[jira] [Updated] (SPARK-20876) If the input parameter is float type for ceil or floor ,the result is not we expected
[ https://issues.apache.org/jira/browse/SPARK-20876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-20876: Affects Version/s: 2.2.0 > If the input parameter is float type for ceil or floor ,the result is not we > expected > -- > > Key: SPARK-20876 > URL: https://issues.apache.org/jira/browse/SPARK-20876 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian > > spark-sql>SELECT ceil(cast(12345.1233 as float)); > spark-sql>12345 > For this case, the result we expected is 12346 > spark-sql>SELECT floor(cast(-12345.1233 as float)); > spark-sql>-12345 > For this case, the result we expected is -12346 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table
[ https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027214#comment-16027214 ] Apache Spark commented on SPARK-6628: - User 'weiqingy' has created a pull request for this issue: https://github.com/apache/spark/pull/18127 > ClassCastException occurs when executing sql statement "insert into" on hbase > table > --- > > Key: SPARK-6628 > URL: https://issues.apache.org/jira/browse/SPARK-6628 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: > org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to > org.apache.hadoop.hive.ql.io.HiveOutputFormat > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115) > at > org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19372: - Fix Version/s: 2.2.0 > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0, 2.3.0 > > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20843. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.2.0 2.1.2 > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman >Assignee: Shixiong Zhu > Labels: regression > Fix For: 2.1.2, 2.2.0 > > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20748) Built-in SQL Function Support - CH[A]R
[ https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20748. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.3.0 > Built-in SQL Function Support - CH[A]R > -- > > Key: SPARK-20748 > URL: https://issues.apache.org/jira/browse/SPARK-20748 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Yuming Wang > Labels: starter > Fix For: 2.3.0 > > > {noformat} > CH[A]R() > {noformat} > Returns a character when given its ASCII code. > Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20905) When running spark with yarn-client, large executor-cores will lead to bad performance.
Cherry Zhang created SPARK-20905: Summary: When running spark with yarn-client, large executor-cores will lead to bad performance. Key: SPARK-20905 URL: https://issues.apache.org/jira/browse/SPARK-20905 Project: Spark Issue Type: Question Components: Examples Affects Versions: 2.0.0 Reporter: Cherry Zhang Hi, all: When I run a training job in spark with yarn-client, and set executor-cores=20(less than vcores=24) and executor-num=4(my cluster has 4 slaves), then there will be always one node computing time is larger than others. I checked some blogs, and they says executor-cores should be set less than 5 if there are tons of concurrency threads. I tried to set executor-cores=4, and executor-num=20, then it worked. But I don't know why, can you give some explain? Thank you very much. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20904) Task failures during shutdown cause problems with preempted executors
Marcelo Vanzin created SPARK-20904: -- Summary: Task failures during shutdown cause problems with preempted executors Key: SPARK-20904 URL: https://issues.apache.org/jira/browse/SPARK-20904 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.6.0 Reporter: Marcelo Vanzin Spark runs tasks in a thread pool that uses daemon threads in each executor. That means that when the JVM gets a signal to shut down, those tasks keep running. Now when YARN preempts an executor, it sends a SIGTERM to the process, triggering the JVM shutdown. That causes shutdown hooks to run which may cause user code running in those tasks to fail, and report task failures to the driver. Those failures are then counted towards the maximum number of allowed failures, even though in this case we don't want that because the executor was preempted. So we need a better way to handle that situation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20462) Spark-Kinesis Direct Connector
[ https://issues.apache.org/jira/browse/SPARK-20462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20462: - Component/s: (was: Input/Output) DStreams > Spark-Kinesis Direct Connector > --- > > Key: SPARK-20462 > URL: https://issues.apache.org/jira/browse/SPARK-20462 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Lauren Moos > > I'd like to propose and the vet the design for a direct connector between > Spark and Kinesis. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027036#comment-16027036 ] Sital Kedia commented on SPARK-20178: - >> So to get the robustness for now I'm fine with just invalidating it >> immediately and see how that works. [~tgraves] - Let me know if you want me to resurrect - https://github.com/apache/spark/pull/17088 which exactly does that. It was closed inadvertently as a stale PR. > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10643) Support remote application download in client mode spark submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-10643. - Resolution: Fixed Fix Version/s: 2.2.0 > Support remote application download in client mode spark submit > --- > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Alan Braithwaite >Assignee: Yu Peng >Priority: Minor > Fix For: 2.2.0 > > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Although I'm aware that we can run in cluster mode with mesos, we've already > built some nice tools surrounding marathon for logging and monitoring. > Code in question: > https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10643) Support remote application download in client mode spark submit
[ https://issues.apache.org/jira/browse/SPARK-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-10643: --- Assignee: Yu Peng > Support remote application download in client mode spark submit > --- > > Key: SPARK-10643 > URL: https://issues.apache.org/jira/browse/SPARK-10643 > Project: Spark > Issue Type: New Feature > Components: Spark Submit >Reporter: Alan Braithwaite >Assignee: Yu Peng >Priority: Minor > Fix For: 2.2.0 > > > When using mesos with docker and marathon, it would be nice to be able to > make spark-submit deployable on marathon and have that download a jar from > HDFS instead of having to package the jar with the docker. > {code} > $ docker run -it docker.example.com/spark:latest > /usr/local/spark/bin/spark-submit --class > com.example.spark.streaming.EventHandler hdfs://hdfs/tmp/application.jar > Warning: Skip remote jar hdfs://hdfs/tmp/application.jar. > java.lang.ClassNotFoundException: com.example.spark.streaming.EventHandler > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:639) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > Although I'm aware that we can run in cluster mode with mesos, we've already > built some nice tools surrounding marathon for logging and monitoring. > Code in question: > https://github.com/apache/spark/blob/132718ad7f387e1002b708b19e471d9cd907e105/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L723-L736 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20873) Improve the error message for unsupported Column Type
[ https://issues.apache.org/jira/browse/SPARK-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20873. - Resolution: Fixed Assignee: Ruben Janssen Fix Version/s: 2.3.0 > Improve the error message for unsupported Column Type > - > > Key: SPARK-20873 > URL: https://issues.apache.org/jira/browse/SPARK-20873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ruben Janssen >Assignee: Ruben Janssen > Fix For: 2.3.0 > > > For unsupported column type, we simply output the column type instead of the > type name. > {noformat} > java.lang.Exception: Unsupported type: > org.apache.spark.sql.execution.columnar.ColumnTypeSuite$$anonfun$4$$anon$1@2205a05d > {noformat} > We should improve it by outputting its name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20694) Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
[ https://issues.apache.org/jira/browse/SPARK-20694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20694. - Resolution: Fixed Assignee: Maciej Szymkiewicz Fix Version/s: 2.2.0 > Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide > -- > > Key: SPARK-20694 > URL: https://issues.apache.org/jira/browse/SPARK-20694 > Project: Spark > Issue Type: Documentation > Components: Documentation, Examples, SQL >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > > - Spark SQL, DataFrames and Datasets Guide should contain a section about > partitioned, sorted and bucketed writes. > - Bucketing should be removed from Unsupported Hive Functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20843: Assignee: Apache Spark > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman >Assignee: Apache Spark > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20843: Assignee: (was: Apache Spark) > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026870#comment-16026870 ] Apache Spark commented on SPARK-20843: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/18126 > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026846#comment-16026846 ] Shixiong Zhu commented on SPARK-20894: -- Thanks for reporting it. I'm wondering if you can provide the driver log and logs of all executors. "HDFSBackedStateStoreProvider" will write a log when it writes a file. I want to know if it does write "/usr/local/hadoop/checkpoint/state/0/0/1.delta". The write operation may happen on another executor, that's why I want to see all logs. > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20894: - Docs Text: (was: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/05/25 23:01:05 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 1453@ip-172-31-25-189 17/05/25 23:01:05 INFO SignalUtils: Registered signal handler for TERM 17/05/25 23:01:05 INFO SignalUtils: Registered signal handler for HUP 17/05/25 23:01:05 INFO SignalUtils: Registered signal handler for INT 17/05/25 23:01:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/05/25 23:01:06 INFO SecurityManager: Changing view acls to: ubuntu 17/05/25 23:01:06 INFO SecurityManager: Changing modify acls to: ubuntu 17/05/25 23:01:06 INFO SecurityManager: Changing view acls groups to: 17/05/25 23:01:06 INFO SecurityManager: Changing modify acls groups to: 17/05/25 23:01:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 17/05/25 23:01:06 INFO TransportClientFactory: Successfully created connection to /192.31.29.39:52000 after 55 ms (0 ms spent in bootstraps) 17/05/25 23:01:06 INFO SecurityManager: Changing view acls to: ubuntu 17/05/25 23:01:06 INFO SecurityManager: Changing modify acls to: ubuntu 17/05/25 23:01:06 INFO SecurityManager: Changing view acls groups to: 17/05/25 23:01:06 INFO SecurityManager: Changing modify acls groups to: 17/05/25 23:01:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 17/05/25 23:01:06 INFO TransportClientFactory: Successfully created connection to /192.31.29.39:52000 after 1 ms (0 ms spent in bootstraps) 17/05/25 23:01:06 INFO DiskBlockManager: Created local directory at /usr/local/spark/temp/spark-14760b98-21b0-458f-9646-5321c472e66d/executor-d13962fb-be68-4243-832b-78e68f65e784/blockmgr-bc0640eb-2d3d-4933-b83c-1b0222740de5 17/05/25 23:01:06 INFO MemoryStore: MemoryStore started with capacity 912.3 MB 17/05/25 23:01:06 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.31.29.39:52000 17/05/25 23:01:06 INFO WorkerWatcher: Connecting to worker spark://Worker@192.31.25.189:58000 17/05/25 23:01:06 INFO WorkerWatcher: Successfully connected to spark://Worker@192.31.25.189:58000 17/05/25 23:01:06 INFO TransportClientFactory: Successfully created connection to /192.31.25.189:58000 after 4 ms (0 ms spent in bootstraps) 17/05/25 23:01:06 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 17/05/25 23:01:06 INFO Executor: Starting executor ID 0 on host 192.31.25.189 17/05/25 23:01:06 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52100. 17/05/25 23:01:06 INFO NettyBlockTransferService: Server created on 192.31.25.189:52100 17/05/25 23:01:06 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 17/05/25 23:01:06 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.31.25.189, 52100, None) 17/05/25 23:01:06 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.31.25.189, 52100, None) 17/05/25 23:01:06 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.31.25.189, 52100, None) 17/05/25 23:01:10 INFO CoarseGrainedExecutorBackend: Got assigned task 0 17/05/25 23:01:10 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 17/05/25 23:01:10 INFO Executor: Fetching spark://192.31.29.39:52000/jars/dataprocessing-client-stream.jar with timestamp 1495753264902 17/05/25 23:01:10 INFO TransportClientFactory: Successfully created connection to /192.31.29.39:52000 after 2 ms (0 ms spent in bootstraps) 17/05/25 23:01:10 INFO Utils: Fetching spark://192.31.29.39:52000/jars/dataprocessing-client-stream.jar to /usr/local/spark/temp/spark-14760b98-21b0-458f-9646-5321c472e66d/executor-d13962fb-be68-4243-832b-78e68f65e784/spark-4d4ed6ae-202c-4ef0-88ea-ab9f1bdf424e/fetchFileTemp2546265996353018358.tmp 17/05/25 23:01:10 INFO Utils: Copying /usr/local/spark/temp/spark-14760b98-21b0-458f-9646-5321c472e66d/executor-d13962fb-be68-4243-832b-78e68f65e784/spark-4d4ed6ae-202c-4ef0-88ea-ab9f1bdf424e/2978712601495753264902_cache to /usr/local/spark/work/app-20170525230105-0019/0/./dataprocessing-client-stream.jar 17/05/25 23:01:10 INFO Executor: Adding file:/usr/local/spark/work/app-20170525230105-0019/0/./dataprocessing-client-stream.jar to class loader 17/05/25 23:01:10 INFO TorrentBroadcast:
[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026848#comment-16026848 ] Shixiong Zhu commented on SPARK-20894: -- By the way, you can click "More" -> "Attach Files" to upload logs instead. > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026841#comment-16026841 ] Michael Allman commented on SPARK-20843: bq. Will a per-cluster config be enough for your usage? Yes. > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20891) Reduce duplicate code in typedaggregators.scala
[ https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20891: Assignee: (was: Apache Spark) > Reduce duplicate code in typedaggregators.scala > --- > > Key: SPARK-20891 > URL: https://issues.apache.org/jira/browse/SPARK-20891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ruben Janssen > > With SPARK-20411, a significant amount of functions will be added to > typedaggregators.scala, resulting in a large amount of duplicate code -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20891) Reduce duplicate code in typedaggregators.scala
[ https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20891: Assignee: Apache Spark > Reduce duplicate code in typedaggregators.scala > --- > > Key: SPARK-20891 > URL: https://issues.apache.org/jira/browse/SPARK-20891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ruben Janssen >Assignee: Apache Spark > > With SPARK-20411, a significant amount of functions will be added to > typedaggregators.scala, resulting in a large amount of duplicate code -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20891) Reduce duplicate code in typedaggregators.scala
[ https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026831#comment-16026831 ] Apache Spark commented on SPARK-20891: -- User 'setjet' has created a pull request for this issue: https://github.com/apache/spark/pull/18125 > Reduce duplicate code in typedaggregators.scala > --- > > Key: SPARK-20891 > URL: https://issues.apache.org/jira/browse/SPARK-20891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ruben Janssen > > With SPARK-20411, a significant amount of functions will be added to > typedaggregators.scala, resulting in a large amount of duplicate code -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026809#comment-16026809 ] Shixiong Zhu commented on SPARK-20843: -- [~michael] Will a per-cluster config be enough for your usage? A per-app config requires more changes and it's too risky for 2.2 now. > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception
[ https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026807#comment-16026807 ] Sean Owen commented on SPARK-19547: --- Questions are for the mailing list, not JIRA. > KafkaUtil throw 'No current assignment for partition' Exception > --- > > Key: SPARK-19547 > URL: https://issues.apache.org/jira/browse/SPARK-19547 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.6.1 >Reporter: wuchang > > Below is my scala code to create spark kafka stream: > val kafkaParams = Map[String, Object]( > "bootstrap.servers" -> "server110:2181,server110:9092", > "zookeeper" -> "server110:2181", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> classOf[StringDeserializer], > "group.id" -> "example", > "auto.offset.reset" -> "latest", > "enable.auto.commit" -> (false: java.lang.Boolean) > ) > val topics = Array("ABTest") > val stream = KafkaUtils.createDirectStream[String, String]( > ssc, > PreferConsistent, > Subscribe[String, String](topics, kafkaParams) > ) > But after run for 10 hours, it throws exceptions: > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.ConsumerCoordinator: > Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:20,011 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:40,057 INFO [JobGenerator] internals.AbstractCoordinator: > Successfully joined group example with generation 5 > 2017-02-10 10:56:40,058 INFO [JobGenerator] internals.ConsumerCoordinator: > Setting newly assigned partitions [ABTest-1] for group example > 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error > generating jobs for time 148669538 ms > java.lang.IllegalStateException: No current assignment for partition ABTest-0 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248) > at >
[jira] [Resolved] (SPARK-20014) Optimize mergeSpillsWithFileStream method
[ https://issues.apache.org/jira/browse/SPARK-20014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20014. -- Resolution: Fixed Assignee: Sital Kedia Fix Version/s: 2.3.0 > Optimize mergeSpillsWithFileStream method > - > > Key: SPARK-20014 > URL: https://issues.apache.org/jira/browse/SPARK-20014 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 2.3.0 > > > When the individual partition size in a spill is small, > mergeSpillsWithTransferTo method does many small disk ios which is really > inefficient. One way to improve the performance will be to use > mergeSpillsWithFileStream method by turning off transfer to and using > buffered file read/write to improve the io throughput. > However, the current implementation of mergeSpillsWithFileStream does not do > a buffer read/write of the files and in addition to that it unnecessarily > flushes the output files for each partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20844) Remove experimental from API and docs
[ https://issues.apache.org/jira/browse/SPARK-20844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20844. -- Resolution: Fixed Assignee: Michael Armbrust Fix Version/s: 2.2.0 > Remove experimental from API and docs > - > > Key: SPARK-20844 > URL: https://issues.apache.org/jira/browse/SPARK-20844 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 2.2.0 > > > As of Spark 2.2. we know of several large scale production use cases of > Structured Streaming. We should update the working in the API and docs > accordingly before the release. > Lets leave {{Evolving}} on the internal APIs that we still plan to change > though (source and sink). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception
[ https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026795#comment-16026795 ] Pankaj Rastogi commented on SPARK-19547: This ticket is marked as Invalid, but there is no explanation or reasoning. Can someone provide information on why this issue is invalid? > KafkaUtil throw 'No current assignment for partition' Exception > --- > > Key: SPARK-19547 > URL: https://issues.apache.org/jira/browse/SPARK-19547 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 1.6.1 >Reporter: wuchang > > Below is my scala code to create spark kafka stream: > val kafkaParams = Map[String, Object]( > "bootstrap.servers" -> "server110:2181,server110:9092", > "zookeeper" -> "server110:2181", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> classOf[StringDeserializer], > "group.id" -> "example", > "auto.offset.reset" -> "latest", > "enable.auto.commit" -> (false: java.lang.Boolean) > ) > val topics = Array("ABTest") > val stream = KafkaUtils.createDirectStream[String, String]( > ssc, > PreferConsistent, > Subscribe[String, String](topics, kafkaParams) > ) > But after run for 10 hours, it throws exceptions: > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.ConsumerCoordinator: > Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example > 2017-02-10 10:56:20,000 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:20,011 INFO [JobGenerator] internals.AbstractCoordinator: > (Re-)joining group example > 2017-02-10 10:56:40,057 INFO [JobGenerator] internals.AbstractCoordinator: > Successfully joined group example with generation 5 > 2017-02-10 10:56:40,058 INFO [JobGenerator] internals.ConsumerCoordinator: > Setting newly assigned partitions [ABTest-1] for group example > 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error > generating jobs for time 148669538 ms > java.lang.IllegalStateException: No current assignment for partition ABTest-0 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at >
[jira] [Commented] (SPARK-20877) Investigate if tests will time out on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026760#comment-16026760 ] Felix Cheung commented on SPARK-20877: -- ok, I will track down the test run time on windows > Investigate if tests will time out on CRAN > -- > > Key: SPARK-20877 > URL: https://issues.apache.org/jira/browse/SPARK-20877 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
[ https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026694#comment-16026694 ] Apache Spark commented on SPARK-20900: -- User 'nonsleepr' has created a pull request for this issue: https://github.com/apache/spark/pull/18124 > ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set > -- > > Key: SPARK-20900 > URL: https://issues.apache.org/jira/browse/SPARK-20900 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.6.0, 2.1.0 > Environment: Spark 2.1.0 >Reporter: Alexander Bessonov >Priority: Minor > > When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is > not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw > {{IllegalArgumentException}} instead of returning {{null}}. This is not > handled and the exception crashes the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
[ https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20900: Assignee: (was: Apache Spark) > ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set > -- > > Key: SPARK-20900 > URL: https://issues.apache.org/jira/browse/SPARK-20900 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.6.0, 2.1.0 > Environment: Spark 2.1.0 >Reporter: Alexander Bessonov >Priority: Minor > > When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is > not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw > {{IllegalArgumentException}} instead of returning {{null}}. This is not > handled and the exception crashes the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
[ https://issues.apache.org/jira/browse/SPARK-20900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20900: Assignee: Apache Spark > ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set > -- > > Key: SPARK-20900 > URL: https://issues.apache.org/jira/browse/SPARK-20900 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0, 1.6.0, 2.1.0 > Environment: Spark 2.1.0 >Reporter: Alexander Bessonov >Assignee: Apache Spark >Priority: Minor > > When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is > not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw > {{IllegalArgumentException}} instead of returning {{null}}. This is not > handled and the exception crashes the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20902) Word2Vec implementations with Negative Sampling
[ https://issues.apache.org/jira/browse/SPARK-20902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Chopra updated SPARK-20902: --- Description: Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical softmax. Both Continuous bag of words (CBOW) and SkipGram have shown comparative or better performance with Negative Sampling. This umbrella JIRA is to keep a track of the effort to add negative sampling based implementations of both CBOW and SkipGram models to Spark MLlib. Since word2vec is largely a pre-processing step, the performance often can depend on the application it is being used for, and the corpus it is estimated on. These implementation give users the choice of picking one that works best for their use-case. was:Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical softmax. Both Continuous bag of words (CBOW) and SkipGram have shown comparative or better performance with Negative Sampling. This umbrella JIRA is to keep a track of the effort to add negative sampling based implementations of both CBOW and SkipGram models to Spark MLlib. > Word2Vec implementations with Negative Sampling > --- > > Key: SPARK-20902 > URL: https://issues.apache.org/jira/browse/SPARK-20902 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Shubham Chopra > Labels: ML > > Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical > softmax. Both Continuous bag of words (CBOW) and SkipGram have shown > comparative or better performance with Negative Sampling. This umbrella JIRA > is to keep a track of the effort to add negative sampling based > implementations of both CBOW and SkipGram models to Spark MLlib. > Since word2vec is largely a pre-processing step, the performance often can > depend on the application it is being used for, and the corpus it is > estimated on. These implementation give users the choice of picking one that > works best for their use-case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling
[ https://issues.apache.org/jira/browse/SPARK-20903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026669#comment-16026669 ] Apache Spark commented on SPARK-20903: -- User 'shubhamchopra' has created a pull request for this issue: https://github.com/apache/spark/pull/18123 > Word2Vec Skip-Gram + Negative Sampling > -- > > Key: SPARK-20903 > URL: https://issues.apache.org/jira/browse/SPARK-20903 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Shubham Chopra > > SkipGram + Negative Sampling is shown to be comparative or out-performing the > hierarchical softmax based approach currently implemented with Spark. Since > word2vec is largely a pre-processing step, the performance often can depend > on the application it is being used for, and the corpus it is estimated on. > These implementation give users the choice of picking one that works best for > their use-case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling
[ https://issues.apache.org/jira/browse/SPARK-20903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20903: Assignee: (was: Apache Spark) > Word2Vec Skip-Gram + Negative Sampling > -- > > Key: SPARK-20903 > URL: https://issues.apache.org/jira/browse/SPARK-20903 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Shubham Chopra > > SkipGram + Negative Sampling is shown to be comparative or out-performing the > hierarchical softmax based approach currently implemented with Spark. Since > word2vec is largely a pre-processing step, the performance often can depend > on the application it is being used for, and the corpus it is estimated on. > These implementation give users the choice of picking one that works best for > their use-case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling
[ https://issues.apache.org/jira/browse/SPARK-20903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20903: Assignee: Apache Spark > Word2Vec Skip-Gram + Negative Sampling > -- > > Key: SPARK-20903 > URL: https://issues.apache.org/jira/browse/SPARK-20903 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Shubham Chopra >Assignee: Apache Spark > > SkipGram + Negative Sampling is shown to be comparative or out-performing the > hierarchical softmax based approach currently implemented with Spark. Since > word2vec is largely a pre-processing step, the performance often can depend > on the application it is being used for, and the corpus it is estimated on. > These implementation give users the choice of picking one that works best for > their use-case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026663#comment-16026663 ] Michael Allman commented on SPARK-20843: bq. I will say that I don't think its really safe to rely on clean shutdowns for correctness... I agree, but we don't want a "force kill" to become the norm for an app shutdown. An unclean shutdown should be an exceptional situation that will be cause for an ops alert, an investigation and a validation that no data loss or corruption occurred (i.e. our failure mechanisms held). Also, I would say that in some cases where an app integrates with other systems and cannot guarantee transactional or idempotent semantics in the event of a failure, an unclean shutdown *will* require cross-system validation and any necessary data synchronization or recovery. Cheers. > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20903) Word2Vec Skip-Gram + Negative Sampling
Shubham Chopra created SPARK-20903: -- Summary: Word2Vec Skip-Gram + Negative Sampling Key: SPARK-20903 URL: https://issues.apache.org/jira/browse/SPARK-20903 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 2.1.1 Reporter: Shubham Chopra SkipGram + Negative Sampling is shown to be comparative or out-performing the hierarchical softmax based approach currently implemented with Spark. Since word2vec is largely a pre-processing step, the performance often can depend on the application it is being used for, and the corpus it is estimated on. These implementation give users the choice of picking one that works best for their use-case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20372) Word2Vec Continuous Bag Of Words model
[ https://issues.apache.org/jira/browse/SPARK-20372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Chopra updated SPARK-20372: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-20902 > Word2Vec Continuous Bag Of Words model > -- > > Key: SPARK-20372 > URL: https://issues.apache.org/jira/browse/SPARK-20372 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Shubham Chopra > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20902) Word2Vec implementations with Negative Sampling
Shubham Chopra created SPARK-20902: -- Summary: Word2Vec implementations with Negative Sampling Key: SPARK-20902 URL: https://issues.apache.org/jira/browse/SPARK-20902 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.1 Reporter: Shubham Chopra Spark MLlib Word2Vec currently only implements Skip-Gram+Hierarchical softmax. Both Continuous bag of words (CBOW) and SkipGram have shown comparative or better performance with Negative Sampling. This umbrella JIRA is to keep a track of the effort to add negative sampling based implementations of both CBOW and SkipGram models to Spark MLlib. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026643#comment-16026643 ] Marcelo Vanzin commented on SPARK-20843: I mostly followed [~zsxwing]'s review since he didn't seem to have issues with it (and he filed the original bug, SPARK-13602). My comments were mostly stylistic. Given that it may cause problems it might be safer to revert this in 2.2 and fix it in master. Maybe [~bryanc] can chime in too since he worked on the change itself. > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20843) Cannot gracefully kill drivers which take longer than 10 seconds to die
[ https://issues.apache.org/jira/browse/SPARK-20843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026634#comment-16026634 ] Michael Armbrust commented on SPARK-20843: -- I don't have much context here /cc [~zsxwing] and [~vanzin]. I does seem like this could at least be configurable. I will say that I don't think its really safe to rely on clean shutdowns for correctness and I don't think this change affects structured streaming. > Cannot gracefully kill drivers which take longer than 10 seconds to die > --- > > Key: SPARK-20843 > URL: https://issues.apache.org/jira/browse/SPARK-20843 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Michael Allman > Labels: regression > > Commit > https://github.com/apache/spark/commit/1c9a386c6b6812a3931f3fb0004249894a01f657 > changed the behavior of driver process termination. Whereas before > `Process.destroyForcibly` was never called, now it is called (on Java VM's > supporting that API) if the driver process does not die within 10 seconds. > This prevents apps which take longer than 10 seconds to shutdown gracefully > from shutting down gracefully. For example, streaming apps with a large batch > duration (say, 30 seconds+) can take minutes to shutdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20393: -- Fix Version/s: (was: 2.3.0) 2.2.0 > Strengthen Spark to prevent XSS vulnerabilities > --- > > Key: SPARK-20393 > URL: https://issues.apache.org/jira/browse/SPARK-20393 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.2, 2.0.2, 2.1.0 >Reporter: Nicholas Marion >Assignee: Nicholas Marion >Priority: Minor > Labels: security > Fix For: 2.2.0 > > > Using IBM Security AppScan Standard, we discovered several easy to recreate > MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI > application and these vulnerabilities were found to exist in Spark version > 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting > attack is not really an attack on the Spark server as much as an attack on > the end user, taking advantage of their trust in the Spark server to get them > to click on a URL like the ones in the examples below. So whether the user > could or could not change lots of stuff on the Spark server is not the key > point. It is an attack on the user themselves. If they click the link the > script could run in their browser and comprise their device. Once the > browser is compromised it could submit Spark requests but it also might not. > https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/ > {quote} > Request: GET > /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a-- > _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > HTTP/1.1 > Excerpt from response: No running application with ID > Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET > /history/app-20161012202114-0038/stages/stage?id=1=0=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > k.pageSize=100 HTTP/1.1 > Excerpt from response: Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET /log?appId=app-20170113131903-=0=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > eLength=0 HTTP/1.1 > Excerpt from response: Bytes 0-0 of 0 of > /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content- > Type: multipart/related; boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > {quote} > security@apache was notified and recommended a PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026595#comment-16026595 ] Shixiong Zhu commented on SPARK-20882: -- [~cenyuhai] do you mind to share what's the root cause? > Executor is waiting for ShuffleBlockFetcherIterator > --- > > Key: SPARK-20882 > URL: https://issues.apache.org/jira/browse/SPARK-20882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: cen yuhai > Attachments: executor_jstack, executor_log, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > This bug is like https://issues.apache.org/jira/browse/SPARK-19300. > but I have updated my client netty version to 4.0.43.Final. > The shuffle service handler is still 4.0.42.Final > spark.sql.adaptive.enabled is true > {code} > "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 > tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > parking to wait for <0x000498c249c0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {code} > {code} > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, > shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, > shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_809) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 21 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 20 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 19 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 18 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 17 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 16 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 15 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 14 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 13 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 12 > 17/05/26 12:04:06 DEBUG
[jira] [Commented] (SPARK-20897) cached self-join should not fail
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026588#comment-16026588 ] Michael Armbrust commented on SPARK-20897: -- Is this a regression? If so, can you please make sure that its targeted at the 2.2.0 release. > cached self-join should not fail > > > Key: SPARK-20897 > URL: https://issues.apache.org/jira/browse/SPARK-20897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > code to reproduce this bug: > {code} > // force to plan sort merge join > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df = Seq(1 -> "a").toDF("i", "j") > val df1 = df.as("t1") > val df2 = df.as("t2") > assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20901) Feature parity for ORC with Parquet
Dongjoon Hyun created SPARK-20901: - Summary: Feature parity for ORC with Parquet Key: SPARK-20901 URL: https://issues.apache.org/jira/browse/SPARK-20901 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Dongjoon Hyun This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11412) Support merge schema for ORC
[ https://issues.apache.org/jira/browse/SPARK-11412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-11412: -- Affects Version/s: 2.1.1 > Support merge schema for ORC > > > Key: SPARK-11412 > URL: https://issues.apache.org/jira/browse/SPARK-11412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: Dave > > when I tried to load partitioned orc files with a slight difference in a > nested column. say > column > -- request: struct (nullable = true) > ||-- datetime: string (nullable = true) > ||-- host: string (nullable = true) > ||-- ip: string (nullable = true) > ||-- referer: string (nullable = true) > ||-- request_uri: string (nullable = true) > ||-- uri: string (nullable = true) > ||-- useragent: string (nullable = true) > And then there's a page_url_lists attributes in the later partitions. > I tried to use > val s = sqlContext.read.format("orc").option("mergeSchema", > "true").load("/data/warehouse/") to load the data. > But the schema doesn't show request.page_url_lists. > I am wondering if schema merge doesn't work for orc? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20900) ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set
Alexander Bessonov created SPARK-20900: -- Summary: ApplicationMaster crashes if SPARK_YARN_STAGING_DIR is not set Key: SPARK-20900 URL: https://issues.apache.org/jira/browse/SPARK-20900 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.1.0, 1.6.0, 1.2.0 Environment: Spark 2.1.0 Reporter: Alexander Bessonov Priority: Minor When running {{ApplicationMaster}} directly, if {{SPARK_YARN_STAGING_DIR}} is not set or set to empty string, {{org.apache.hadoop.fs.Path}} will throw {{IllegalArgumentException}} instead of returning {{null}}. This is not handled and the exception crashes the job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-19809: --- IMO, we had better be more robust on this. The 3rd party tools (reported pig or sqoop) sometimes introduce this issues. {code} scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show ++ || ++ ++ $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show {code} > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at
[jira] [Comment Edited] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026515#comment-16026515 ] Dongjoon Hyun edited comment on SPARK-19809 at 5/26/17 5:09 PM: IMO, we had better be more robust on this. The 3rd party tools (reported pig or sqoop) sometimes introduce this issues. {code} scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show ++ || ++ ++ $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) {code} was (Author: dongjoon): IMO, we had better be more robust on this. The 3rd party tools (reported pig or sqoop) sometimes introduce this issues. {code} scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show ++ || ++ ++ $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show {code} > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at >
[jira] [Updated] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19809: -- Affects Version/s: 2.1.1 > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >
[jira] [Assigned] (SPARK-20835) It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
[ https://issues.apache.org/jira/browse/SPARK-20835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20835: - Assignee: eaton Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > It should exit directly when the --total-executor-cores parameter is setted > less than 0 when submit a application > - > > Key: SPARK-20835 > URL: https://issues.apache.org/jira/browse/SPARK-20835 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: eaton >Assignee: eaton >Priority: Minor > Fix For: 2.3.0 > > > n my test, the submitted app running with out an error when the > --total-executor-cores less than 0 > and given the warnings: > "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources"; > It should exit directly when the --total-executor-cores parameter is setted > less than 0 when submit a application -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20835) It should exit directly when the --total-executor-cores parameter is setted less than 0 when submit a application
[ https://issues.apache.org/jira/browse/SPARK-20835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20835. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18060 [https://github.com/apache/spark/pull/18060] > It should exit directly when the --total-executor-cores parameter is setted > less than 0 when submit a application > - > > Key: SPARK-20835 > URL: https://issues.apache.org/jira/browse/SPARK-20835 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: eaton > Fix For: 2.3.0 > > > n my test, the submitted app running with out an error when the > --total-executor-cores less than 0 > and given the warnings: > "2017-05-22 17:19:36,319 WARN org.apache.spark.scheduler.TaskSchedulerImpl: > Initial job has not accepted any resources; check your cluster UI to ensure > that workers are registered and have sufficient resources"; > It should exit directly when the --total-executor-cores parameter is setted > less than 0 when submit a application -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai closed SPARK-20882. - Resolution: Not A Problem > Executor is waiting for ShuffleBlockFetcherIterator > --- > > Key: SPARK-20882 > URL: https://issues.apache.org/jira/browse/SPARK-20882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: cen yuhai > Attachments: executor_jstack, executor_log, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > This bug is like https://issues.apache.org/jira/browse/SPARK-19300. > but I have updated my client netty version to 4.0.43.Final. > The shuffle service handler is still 4.0.42.Final > spark.sql.adaptive.enabled is true > {code} > "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 > tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > parking to wait for <0x000498c249c0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {code} > {code} > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, > shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, > shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_809) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 21 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 20 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 19 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 18 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 17 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 16 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 15 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 14 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 13 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 12 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 11 > 17/05/26
[jira] [Assigned] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
[ https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20899: Assignee: Apache Spark > PySpark supports stringIndexerOrderType in RFormula > --- > > Key: SPARK-20899 > URL: https://issues.apache.org/jira/browse/SPARK-20899 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
[ https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20899: Assignee: (was: Apache Spark) > PySpark supports stringIndexerOrderType in RFormula > --- > > Key: SPARK-20899 > URL: https://issues.apache.org/jira/browse/SPARK-20899 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.1 >Reporter: Wayne Zhang > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
[ https://issues.apache.org/jira/browse/SPARK-20899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026489#comment-16026489 ] Apache Spark commented on SPARK-20899: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/18122 > PySpark supports stringIndexerOrderType in RFormula > --- > > Key: SPARK-20899 > URL: https://issues.apache.org/jira/browse/SPARK-20899 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.1 >Reporter: Wayne Zhang > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20899) PySpark supports stringIndexerOrderType in RFormula
Wayne Zhang created SPARK-20899: --- Summary: PySpark supports stringIndexerOrderType in RFormula Key: SPARK-20899 URL: https://issues.apache.org/jira/browse/SPARK-20899 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.1.1 Reporter: Wayne Zhang -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN
Thomas Graves created SPARK-20898: - Summary: spark.blacklist.killBlacklistedExecutors doesn't work in YARN Key: SPARK-20898 URL: https://issues.apache.org/jira/browse/SPARK-20898 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Thomas Graves I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but it doesn't appear to work. Everytime I get: 17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted executor id 4 since allocation client is not defined Even though dynamic allocation is on. Taking a quick look, I think the way it creates the blacklisttracker and passes the allocation client is wrong. The scheduler backend is not set yet so it never passes the allocation client to the blacklisttracker correctly. Thus it will never kill. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20897) cached self-join should not fail
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20897: Assignee: Wenchen Fan (was: Apache Spark) > cached self-join should not fail > > > Key: SPARK-20897 > URL: https://issues.apache.org/jira/browse/SPARK-20897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > code to reproduce this bug: > {code} > // force to plan sort merge join > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df = Seq(1 -> "a").toDF("i", "j") > val df1 = df.as("t1") > val df2 = df.as("t2") > assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20897) cached self-join should not fail
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20897: Assignee: Apache Spark (was: Wenchen Fan) > cached self-join should not fail > > > Key: SPARK-20897 > URL: https://issues.apache.org/jira/browse/SPARK-20897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > > code to reproduce this bug: > {code} > // force to plan sort merge join > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df = Seq(1 -> "a").toDF("i", "j") > val df1 = df.as("t1") > val df2 = df.as("t2") > assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20897) cached self-join should not fail
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026392#comment-16026392 ] Apache Spark commented on SPARK-20897: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18121 > cached self-join should not fail > > > Key: SPARK-20897 > URL: https://issues.apache.org/jira/browse/SPARK-20897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > code to reproduce this bug: > {code} > // force to plan sort merge join > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df = Seq(1 -> "a").toDF("i", "j") > val df1 = df.as("t1") > val df2 = df.as("t2") > assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20897) cached self-join should not fail
Wenchen Fan created SPARK-20897: --- Summary: cached self-join should not fail Key: SPARK-20897 URL: https://issues.apache.org/jira/browse/SPARK-20897 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan code to reproduce this bug: {code} // force to plan sort merge join spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") val df = Seq(1 -> "a").toDF("i", "j") val df1 = df.as("t1") val df2 = df.as("t2") assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20775) from_json should also have an API where the schema is specified with a string
[ https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20775: --- Assignee: Ruben Janssen > from_json should also have an API where the schema is specified with a string > - > > Key: SPARK-20775 > URL: https://issues.apache.org/jira/browse/SPARK-20775 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Ruben Janssen > Fix For: 2.3.0 > > > Right now you also have to provide a java.util.Map which is not nice for > Scala users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20887) support alternative keys in ConfigBuilder
[ https://issues.apache.org/jira/browse/SPARK-20887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20887. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18110 [https://github.com/apache/spark/pull/18110] > support alternative keys in ConfigBuilder > - > > Key: SPARK-20887 > URL: https://issues.apache.org/jira/browse/SPARK-20887 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19669) Open up visibility for sharedState, sessionState, and a few other functions
[ https://issues.apache.org/jira/browse/SPARK-19669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026219#comment-16026219 ] Steve Loughran commented on SPARK-19669: thanks for this, very nice to have Logging usable outside Spark's own codebase again; I had my own copy & paste but when you start playing with traits across things, life gets hard > Open up visibility for sharedState, sessionState, and a few other functions > --- > > Key: SPARK-19669 > URL: https://issues.apache.org/jira/browse/SPARK-19669 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > To ease debugging, most of Spark SQL internals have public level visibility. > Two of the most important internal states, sharedState and sessionState, > however, are package private. It would make more sense to open these up as > well with clear documentation that they are internal. > In addition, users currently have way to set active/default SparkSession, but > no way to actually get them back. We should open those up as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-20199: - Description: Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses random forest internally ,which have featureSubsetStrategy hardcoded "all". It should be provided by the user to have randomness at the feature level. This parameter is available in H2O and XGBoost. Sample from H2O.ai gbmParams._col_sample_rate Please provide the parameter . was: Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter . This parameter is available in H2O and XGBoost. Sample from H2O.ai gbmParams._col_sample_rate Please provide the parameter . > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have featureSubsetStrategy . It Uses > random forest internally ,which have featureSubsetStrategy hardcoded "all". > It should be provided by the user to have randomness at the feature level. > This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-20896: - Description: 1、zeppelin 0.6.2 in *SCOPE* mode 2、spark 1.6.2 3、HDP 2.4 for HDFS YARN trigger scala code like : {quote} var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") val vectorDf = assembler.transform(tmpDataFrame) val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)} val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman") val columns = correlMatrix.toArray.grouped(correlMatrix.numRows) val rows = columns.toSeq.transpose val vectors = rows.map(row => new DenseVector(row.toArray)) val vRdd = sc.parallelize(vectors) import sqlContext.implicits._ val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF() val rows = dfV.rdd.zipWithIndex.map(_.swap) .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap)) .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x)} {quote} --- and code : {quote} var df = sql("select b1,b2 from .x") var i = 0 var threshold = Array(2.0,3.0) var inputCols = Array("b1","b2") var tmpDataFrame = df for (col <- inputCols){ val binarizer: Binarizer = new Binarizer().setInputCol(col) .setOutputCol(inputCols(i)+"_binary") .setThreshold(threshold(i)) tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i)) i = i+1 } var saveDFBin = tmpDataFrame val dfAppendBin = sql("select b3 from poseidon.corelatdemo") val rows = saveDFBin.rdd.zipWithIndex.map(_.swap) .join(dfAppendBin.rdd.zipWithIndex.map(_.swap)) .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ row2.toSeq)} import org.apache.spark.sql.types.StructType val rowSchema = StructType(saveDFBin.schema.fields ++ dfAppendBin.schema.fields) saveDFBin = sqlContext.createDataFrame(rows, rowSchema) //save result to table import org.apache.spark.sql.SaveMode saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".") sql("alter table . set lifecycle 1") {quote} on zeppelin with two different notebook at same time. Found this exception log in executor : {quote} l1.dtdream.com): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2 at $line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} OR {quote} java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.linalg.DenseVector at $line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} some log from executor: {quote} 17/05/26 16:39:44 INFO executor.Executor: Running task 3.1 in stage 36.0 (TID 598) 17/05/26 16:39:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 30 17/05/26 16:39:44 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as
[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-20896: - Description: 1、zeppelin 0.6.2 in *SCOPE* mode 2、spark 1.6.2 3、HDP 2.4 for HDFS YARN trigger scala code like : {quote} var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") val vectorDf = assembler.transform(tmpDataFrame) val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)} val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman") val columns = correlMatrix.toArray.grouped(correlMatrix.numRows) val rows = columns.toSeq.transpose val vectors = rows.map(row => new DenseVector(row.toArray)) val vRdd = sc.parallelize(vectors) import sqlContext.implicits._ val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF() val rows = dfV.rdd.zipWithIndex.map(_.swap) .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap)) .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x)} {quote} --- and code : {quote} var df = sql("select b1,b2 from .x") var i = 0 var threshold = Array(2.0,3.0) var inputCols = Array("b1","b2") var tmpDataFrame = df for (col <- inputCols){ val binarizer: Binarizer = new Binarizer().setInputCol(col) .setOutputCol(inputCols(i)+"_binary") .setThreshold(threshold(i)) tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i)) i = i+1 } var saveDFBin = tmpDataFrame val dfAppendBin = sql("select b3 from poseidon.corelatdemo") val rows = saveDFBin.rdd.zipWithIndex.map(_.swap) .join(dfAppendBin.rdd.zipWithIndex.map(_.swap)) .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ row2.toSeq)} import org.apache.spark.sql.types.StructType val rowSchema = StructType(saveDFBin.schema.fields ++ dfAppendBin.schema.fields) saveDFBin = sqlContext.createDataFrame(rows, rowSchema) //save result to table import org.apache.spark.sql.SaveMode saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".") sql("alter table . set lifecycle 1") {quote} on zeppelin with two different notebook at same time. Found this exception log in executor : {quote} l1.dtdream.com): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2 at $line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} OR {quote} java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.linalg.DenseVector at $line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} some log from executor: {quote} 17/05/26 16:39:44 INFO executor.Executor: Running task 3.1 in stage 36.0 (TID 598) 17/05/26 16:39:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 30 17/05/26 16:39:44 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as
[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-20896: - Description: 1、zeppelin 0.6.2 2、spark 1.6.2 3、hdp 2.4 for HDFS YARN trigger scala code like : {quote} var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") val vectorDf = assembler.transform(tmpDataFrame) val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)} val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman") val columns = correlMatrix.toArray.grouped(correlMatrix.numRows) val rows = columns.toSeq.transpose val vectors = rows.map(row => new DenseVector(row.toArray)) val vRdd = sc.parallelize(vectors) import sqlContext.implicits._ val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF() val rows = dfV.rdd.zipWithIndex.map(_.swap) .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap)) .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x)} {quote} --- and code : {quote} var df = sql("select b1,b2 from .x") var i = 0 var threshold = Array(2.0,3.0) var inputCols = Array("b1","b2") var tmpDataFrame = df for (col <- inputCols){ val binarizer: Binarizer = new Binarizer().setInputCol(col) .setOutputCol(inputCols(i)+"_binary") .setThreshold(threshold(i)) tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i)) i = i+1 } var saveDFBin = tmpDataFrame val dfAppendBin = sql("select b3 from poseidon.corelatdemo") val rows = saveDFBin.rdd.zipWithIndex.map(_.swap) .join(dfAppendBin.rdd.zipWithIndex.map(_.swap)) .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ row2.toSeq)} import org.apache.spark.sql.types.StructType val rowSchema = StructType(saveDFBin.schema.fields ++ dfAppendBin.schema.fields) saveDFBin = sqlContext.createDataFrame(rows, rowSchema) //save result to table import org.apache.spark.sql.SaveMode saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".") sql("alter table . set lifecycle 1") {quote} on zeppelin with two different notebook at same time. Found this exeption log in executor : {quote} l1.dtdream.com): java.lang.ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2 at $line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) at org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} OR {quote} java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.linalg.DenseVector at $line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} these two execption nerver show in pairs. was: 1、zeppelin 0.6.2 2、spark 1.6.2 3、hdp 2.4 for HDFS YARN trigger scala code like : {quote} var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") val vectorDf = assembler.transform(tmpDataFrame) val vectRdd = vectorDf.select("features").map{x:Row =>
[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-20896: - Description: 1、zeppelin 0.6.2 2、spark 1.6.2 3、hdp 2.4 for HDFS YARN trigger scala code like : {quote} var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") val vectorDf = assembler.transform(tmpDataFrame) val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)} val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman") val columns = correlMatrix.toArray.grouped(correlMatrix.numRows) val rows = columns.toSeq.transpose val vectors = rows.map(row => new DenseVector(row.toArray)) val vRdd = sc.parallelize(vectors) import sqlContext.implicits._ val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF() val rows = dfV.rdd.zipWithIndex.map(_.swap) .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap)) .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x)} {quote} --- and code : {quote} var df = sql("select b1,b2 from .x") var i = 0 var threshold = Array(2.0,3.0) var inputCols = Array("b1","b2") var tmpDataFrame = df for (col <- inputCols){ val binarizer: Binarizer = new Binarizer().setInputCol(col) .setOutputCol(inputCols(i)+"_binary") .setThreshold(threshold(i)) tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i)) i = i+1 } var saveDFBin = tmpDataFrame val dfAppendBin = sql("select b3 from poseidon.corelatdemo") val rows = saveDFBin.rdd.zipWithIndex.map(_.swap) .join(dfAppendBin.rdd.zipWithIndex.map(_.swap)) .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq ++ row2.toSeq)} import org.apache.spark.sql.types.StructType val rowSchema = StructType(saveDFBin.schema.fields ++ dfAppendBin.schema.fields) saveDFBin = sqlContext.createDataFrame(rows, rowSchema) //save result to table import org.apache.spark.sql.SaveMode saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".") sql("alter table . set lifecycle 1") {quote} at the same time was: 1、zeppelin 0.6.2 2、spark 1.6.2 3、hdp 2.4 for HDFS YARN trigger scala code like > spark executor get java.lang.ClassCastException when trigger two job at same > time > - > > Key: SPARK-20896 > URL: https://issues.apache.org/jira/browse/SPARK-20896 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: poseidon > > 1、zeppelin 0.6.2 > 2、spark 1.6.2 > 3、hdp 2.4 for HDFS YARN > trigger scala code like : > {quote} > var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") > val vectorDf = assembler.transform(tmpDataFrame) > val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)} > val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman") > val columns = correlMatrix.toArray.grouped(correlMatrix.numRows) > val rows = columns.toSeq.transpose > val vectors = rows.map(row => new DenseVector(row.toArray)) > val vRdd = sc.parallelize(vectors) > import sqlContext.implicits._ > val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF() > val rows = dfV.rdd.zipWithIndex.map(_.swap) > > .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap)) > .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq > :+ x)} > {quote} > --- > and code : > {quote} > var df = sql("select b1,b2 from .x") > var i = 0 > var threshold = Array(2.0,3.0) > var inputCols = Array("b1","b2") > var tmpDataFrame = df > for (col <- inputCols){ > val binarizer: Binarizer = new Binarizer().setInputCol(col) > .setOutputCol(inputCols(i)+"_binary") > .setThreshold(threshold(i)) > tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i)) > i = i+1 > } > var saveDFBin = tmpDataFrame > val dfAppendBin = sql("select b3 from poseidon.corelatdemo") > val rows = saveDFBin.rdd.zipWithIndex.map(_.swap) > .join(dfAppendBin.rdd.zipWithIndex.map(_.swap)) > .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq > ++ row2.toSeq)} > import org.apache.spark.sql.types.StructType > val rowSchema = StructType(saveDFBin.schema.fields ++ > dfAppendBin.schema.fields) > saveDFBin = sqlContext.createDataFrame(rows, rowSchema) > //save result to table > import org.apache.spark.sql.SaveMode > saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".") > sql("alter table . set lifecycle 1") > {quote} > at the same time -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-20896: - Attachment: (was: token_err.log) > spark executor get java.lang.ClassCastException when trigger two job at same > time > - > > Key: SPARK-20896 > URL: https://issues.apache.org/jira/browse/SPARK-20896 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: poseidon > > 1、zeppelin 0.6.2 > 2、spark 1.6.2 > 3、hdp 2.4 for HDFS YARN > trigger scala code like -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] poseidon updated SPARK-20896: - Attachment: token_err.log > spark executor get java.lang.ClassCastException when trigger two job at same > time > - > > Key: SPARK-20896 > URL: https://issues.apache.org/jira/browse/SPARK-20896 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: poseidon > > 1、zeppelin 0.6.2 > 2、spark 1.6.2 > 3、hdp 2.4 for HDFS YARN > trigger scala code like -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
poseidon created SPARK-20896: Summary: spark executor get java.lang.ClassCastException when trigger two job at same time Key: SPARK-20896 URL: https://issues.apache.org/jira/browse/SPARK-20896 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.1 Reporter: poseidon 1、zeppelin 0.6.2 2、spark 1.6.2 3、hdp 2.4 for HDFS YARN trigger scala code like -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026040#comment-16026040 ] cen yuhai edited comment on SPARK-20882 at 5/26/17 9:13 AM: [~zsxwing] yes. {code} 17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests outstanding when connection from 10.0.139.110:7337 is closed 17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests outstanding when connection from 10.0.139.110:7337 is closed {code} before the requests in flight 1. the remainingBlocks is empty. {code} 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_809) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() {code} was (Author: cenyuhai): yes. {code} 17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests outstanding when connection from 10.0.139.110:7337 is closed 17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests outstanding when connection from 10.0.139.110:7337 is closed {code} before the requests in flight 1. the remainingBlocks is empty. {code} 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_809) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() {code} > Executor is waiting for ShuffleBlockFetcherIterator > --- > > Key: SPARK-20882 > URL: https://issues.apache.org/jira/browse/SPARK-20882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: cen yuhai > Attachments: executor_jstack, executor_log, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > This bug is like https://issues.apache.org/jira/browse/SPARK-19300. > but I have updated my client netty version to 4.0.43.Final. > The shuffle service handler is still 4.0.42.Final > spark.sql.adaptive.enabled is true > {code} > "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 > tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > parking to wait for <0x000498c249c0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {code} > {code} > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, > shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, > shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809) > 17/05/26
[jira] [Commented] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026040#comment-16026040 ] cen yuhai commented on SPARK-20882: --- yes. {code} 17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests outstanding when connection from 10.0.139.110:7337 is closed 17/05/26 12:04:14 INFO TransportResponseHandler: Still have 1 requests outstanding when connection from 10.0.139.110:7337 is closed {code} before the requests in flight 1. the remainingBlocks is empty. {code} 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_809) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() {code} > Executor is waiting for ShuffleBlockFetcherIterator > --- > > Key: SPARK-20882 > URL: https://issues.apache.org/jira/browse/SPARK-20882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1 >Reporter: cen yuhai > Attachments: executor_jstack, executor_log, screenshot-1.png, > screenshot-2.png, screenshot-3.png > > > This bug is like https://issues.apache.org/jira/browse/SPARK-19300. > but I have updated my client netty version to 4.0.43.Final. > The shuffle service handler is still 4.0.42.Final > spark.sql.adaptive.enabled is true > {code} > "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 > tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > parking to wait for <0x000498c249c0> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > at org.apache.spark.scheduler.Task.run(Task.scala:114) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {code} > {code} > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, > shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, > shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_808, shuffle_5_1431_809) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: > Set(shuffle_5_1431_809) > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 21 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 20 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 19 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 18 > 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in > flight 17 >
[jira] [Updated] (SPARK-20882) Executor is waiting for ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-20882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-20882: -- Description: This bug is like https://issues.apache.org/jira/browse/SPARK-19300. but I have updated my client netty version to 4.0.43.Final. The shuffle service handler is still 4.0.42.Final spark.sql.adaptive.enabled is true {code} "Executor task launch worker for task 4808985" #5373 daemon prio=5 os_prio=0 tid=0x7f54ef437000 nid=0x1aed0 waiting on condition [0x7f53aebfe000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) parking to wait for <0x000498c249c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:189) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:58) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) at org.apache.spark.scheduler.Task.run(Task.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:323) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834) {code} {code} 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_805, shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, shuffle_5_1431_807) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_808, shuffle_5_1431_806, shuffle_5_1431_809, shuffle_5_1431_807) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_808, shuffle_5_1431_809, shuffle_5_1431_807) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_808, shuffle_5_1431_809) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set(shuffle_5_1431_809) 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: remainingBlocks: Set() 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 21 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 20 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 19 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 18 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 17 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 16 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 15 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 14 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 13 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 12 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 11 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 10 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 9 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 8 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 7 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 6 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 5 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in flight 4 17/05/26 12:04:06 DEBUG ShuffleBlockFetcherIterator: Number of requests in
[jira] [Issue Comment Deleted] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-20199: - Comment: was deleted (was: Shouldn't there be a parameter in GBMParameters() val gbmParams = new GBMParameters() gbmParams._featuresubsetStrategy="auto" and then in private[ml] def train(data: RDD[LabeledPoint], oldStrategy: OldStrategy): DecisionTreeRegressionModel = { we should have oldStrategy.getStrategy() and set it in val trees = RandomForest.run(data, oldStrategy, numTrees = 1, featureSubsetStrategy = oldStrategy.getStrategy(), seed = $(seed), instr = Some(instr), parentUID = Some(uid))) > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015345#comment-16015345 ] Yan Facai (颜发才) edited comment on SPARK-19581 at 5/26/17 9:02 AM: -- [~barrybecker4] Hi, Becker. I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7. 1) For 0 size of feature, the exception is harmless. {code} val data = spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache import org.apache.spark.ml.classification.NaiveBayes val model = new NaiveBayes().fit(data) import org.apache.spark.ml.linalg.{Vectors => SV} case class TestData(features: org.apache.spark.ml.linalg.Vector) val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double]) val test = Seq(TestData(emptyVector)).toDF scala> test.show +-+ | features| +-+ |(0,[],[])| +-+ scala> model.transform(test).show org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... 48 elided Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 692, x: 0 at scala.Predef$.require(Predef.scala:224) ... 99 more {code} 2) For 692 size of empty feature, it's OK. {code} scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double]) emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[]) scala> val test = Seq(TestData(emptyVector)).toDF test: org.apache.spark.sql.DataFrame = [features: vector] scala> test.show +---+ | features| +---+ |(692,[],[])| +---+ scala> model.transform(test).show +---+++--+ | features| rawPrediction| probability|prediction| +---+++--+ |(692,[],[])|[-0.8407831793660...|[0.43137254901960...| 1.0| +---+++--+ {code} was (Author: facai): [~barrybecker4] Hi, Becker. I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7. 1) For 0 size of feature, the exception is harmless. ```scala val data = spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache import org.apache.spark.ml.classification.NaiveBayes val model = new NaiveBayes().fit(data) import org.apache.spark.ml.linalg.{Vectors => SV} case class TestData(features: org.apache.spark.ml.linalg.Vector) val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double]) val test = Seq(TestData(emptyVector)).toDF scala> test.show +-+ | features| +-+ |(0,[],[])| +-+ scala> model.transform(test).show org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... 48 elided Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 692, x: 0 at scala.Predef$.require(Predef.scala:224) ... 99 more ``` 2) For 692 size of empty feature, it's OK. ```scala scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double]) emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[]) scala> val test = Seq(TestData(emptyVector)).toDF test: org.apache.spark.sql.DataFrame = [features: vector] scala> test.show +---+ | features| +---+ |(692,[],[])| +---+ scala> model.transform(test).show +---+++--+ | features| rawPrediction| probability|prediction| +---+++--+ |(692,[],[])|[-0.8407831793660...|[0.43137254901960...| 1.0| +---+++--+ ``` > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are
[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders
[ https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026028#comment-16026028 ] Reynold Xin commented on SPARK-20895: - Have you seen a specific case in practice in which the optimizer is the bottleneck? Maybe we should improve optimizer speed and add these in the future when needed ... it's a lot of complexity. > Support fast execution based on an optimized plan and parameter placeholders > > > Key: SPARK-20895 > URL: https://issues.apache.org/jira/browse/SPARK-20895 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In database scenarios, users sometimes use parameterized queries for repeated > execution (e.g., by using prepared statements). > So, I think this functionality is also useful for Spark users. > What I suggest here seems to be like: > My prototype here: > https://github.com/apache/spark/compare/master...maropu:PreparedStmt2 > {code} > scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t") > // Define a query with a parameter placeholder named `val` > scala> val df = sql("SELECT * FROM t WHERE col1 = $val") > scala> df.explain > == Physical Plan == > *Project [_1#13 AS col1#16, _2#14 AS col2#17] > +- *Filter (_1#13 = cast(parameterholder(val) as int)) >+- LocalTableScan [_1#13, _2#14] > // Apply optimizer rules and get an optimized logical plan with the parameter > placeholder > scala> val preparedDf = df.prepared > // Bind an actual value and do execution > scala> preparedDf.bindParam("val", 1).show() > +++ > |col1|col2| > +++ > | 1| 2| > +++ > {code} > To implement this, my prototype adds a new expression leaf node named > `ParameterHolder`. > In a binding phase, this node is replaced with `Literal` including an actual > value by using `bindParam`. > Currently, Spark sometimes consumes much time to rewrite logical plans in > `Optimizer` (e.g. constant propagation desribed in SPARK-19846). > So, I feel this approach is also helpful in that case: > {code} > def timer[R](f: => {}): Unit = { > val count = 9 > val iters = (0 until count).map { i => > val t0 = System.nanoTime() > f > val t1 = System.nanoTime() > val elapsed = t1 - t0 + 0.0 > println(s"#$i: ${elapsed / 10.0}") > elapsed > } > println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s") > } > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val numCols = 50 > val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS > _c$i"): _*) > // Add conditions to take much time in Optimizer > val filter = (0 until 128).foldLeft(lit(false))((e, i) => > e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int"))) > val df2 = df.filter(filter).sort(df.columns(0)) > // Regular path > timer { > df2.filter(df2.col(df2.columns(0)) === lit(3)).collect > df2.filter(df2.col(df2.columns(0)) === lit(4)).collect > df2.filter(df2.col(df2.columns(0)) === lit(5)).collect > df2.filter(df2.col(df2.columns(0)) === lit(6)).collect > df2.filter(df2.col(df2.columns(0)) === lit(7)).collect > df2.filter(df2.col(df2.columns(0)) === lit(8)).collect > } > #0: 24.178487906 > #1: 22.619839888 > #2: 22.318617035 > #3: 22.131305502 > #4: 22.532095611 > #5: 22.245152778 > #6: 22.314114847 > #7: 22.284385952 > #8: 22.053593855 > Avg. Elapsed Time: 22.51973259712s > // Prepared path > val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared > timer { > df3b.bindParam("val", 3).collect > df3b.bindParam("val", 4).collect > df3b.bindParam("val", 5).collect > df3b.bindParam("val", 6).collect > df3b.bindParam("val", 7).collect > df3b.bindParam("val", 8).collect > } > #0: 0.744693912 > #1: 0.743187129 > #2: 0.74513 > #3: 0.721668718 > #4: 0.757573342 > #5: 0.763240883 > #6: 0.731287275 > #7: 0.728740601 > #8: 0.674275592 > Avg. Elapsed Time: 0.734418606112s > {code} > I'm not sure this approach is acceptable, so welcome any suggestion and > advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20498: Assignee: Apache Spark (was: Xin Ren) > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Apache Spark >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20498: Assignee: Xin Ren (was: Apache Spark) > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20891) Reduce duplicate code in typedaggregators.scala
[ https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026021#comment-16026021 ] Sean Owen commented on SPARK-20891: --- Can you refactor the code first, then make your change? I understand your point but unless it's a huge change, you probably want to make the change plus necessary other updates together. Yes, someone might make a conflicting change in the meantime and you'd have to be vigilant to keep your PR up to date until it's merged. > Reduce duplicate code in typedaggregators.scala > --- > > Key: SPARK-20891 > URL: https://issues.apache.org/jira/browse/SPARK-20891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ruben Janssen > > With SPARK-20411, a significant amount of functions will be added to > typedaggregators.scala, resulting in a large amount of duplicate code -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026022#comment-16026022 ] Apache Spark commented on SPARK-20498: -- User 'facaiy' has created a pull request for this issue: https://github.com/apache/spark/pull/18120 > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19581. --- Resolution: Cannot Reproduce > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are some cases, where all the features may get removed. When this > happens, executors will crash and get restarted (if on a cluster) or spark > will crash and need to be manually restarted (if in development mode). > It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well > when it is encountered. I emits this vague error : > ** On entry to DGEMV parameter number 6 had an illegal value > and terminates. > My code looks like this: > {code} >val predictions = model.transform(testData) // Make predictions > // figure out how many were correctly predicted > val numCorrect = predictions.filter(new Column(actualTarget) === new > Column(PREDICTION_LABEL_COLUMN)).count() > val numIncorrect = testRowCount - numCorrect > {code} > The failure is at the line that does the count, but it is not the count that > causes the problem, it is the model.transform step (where the model contains > the NaiveBayes classifier). > Here is the stack trace (in development mode): > {code} > [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] > [akka://JobServer/user/context-supervisor/sql-context] - done making > predictions in 232 > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505) > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29) > [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException: > Job 12 cancelled because SparkContext was shut down)) > [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] > [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable > org.apache.spark.SparkException: Job 12 cancelled because SparkContext was > shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1825) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) >
[jira] [Resolved] (SPARK-20767) The training continuation for saved LDA model
[ https://issues.apache.org/jira/browse/SPARK-20767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20767. --- Resolution: Duplicate Let's roll this proposal into the existing issue > The training continuation for saved LDA model > - > > Key: SPARK-20767 > URL: https://issues.apache.org/jira/browse/SPARK-20767 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Cezary Dendek >Priority: Minor > > Current online implementation of the LDA model fit (OnlineLDAOptimizer) does > not support the model update (ie. to account for the population/covariates > drift) nor the continuation of model fitting in case of the insufficient > number of iterations. > Technical aspects: > 1. The implementation of LDA fitting does not currently allow the > coefficients pre-setting (private setter), as noted by a comment in the > source code of OnlineLDAOptimizer.setLambda: "This is only used for testing > now. In the future, it can help support training stop/resume". > 2. The lambda matrix is always randomly initialized by the optimizer, which > needs fixing for preset lambda matrix. > The adaptation of the classes by the user is not possible due to protected > setters & sealed / final classes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders
[ https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025976#comment-16025976 ] Takeshi Yamamuro commented on SPARK-20895: -- At least, I feel we could get performance improvements based on this when optimizer takes much time described in the description. > Support fast execution based on an optimized plan and parameter placeholders > > > Key: SPARK-20895 > URL: https://issues.apache.org/jira/browse/SPARK-20895 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In database scenarios, users sometimes use parameterized queries for repeated > execution (e.g., by using prepared statements). > So, I think this functionality is also useful for Spark users. > What I suggest here seems to be like: > My prototype here: > https://github.com/apache/spark/compare/master...maropu:PreparedStmt2 > {code} > scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t") > // Define a query with a parameter placeholder named `val` > scala> val df = sql("SELECT * FROM t WHERE col1 = $val") > scala> df.explain > == Physical Plan == > *Project [_1#13 AS col1#16, _2#14 AS col2#17] > +- *Filter (_1#13 = cast(parameterholder(val) as int)) >+- LocalTableScan [_1#13, _2#14] > // Apply optimizer rules and get an optimized logical plan with the parameter > placeholder > scala> val preparedDf = df.prepared > // Bind an actual value and do execution > scala> preparedDf.bindParam("val", 1).show() > +++ > |col1|col2| > +++ > | 1| 2| > +++ > {code} > To implement this, my prototype adds a new expression leaf node named > `ParameterHolder`. > In a binding phase, this node is replaced with `Literal` including an actual > value by using `bindParam`. > Currently, Spark sometimes consumes much time to rewrite logical plans in > `Optimizer` (e.g. constant propagation desribed in SPARK-19846). > So, I feel this approach is also helpful in that case: > {code} > def timer[R](f: => {}): Unit = { > val count = 9 > val iters = (0 until count).map { i => > val t0 = System.nanoTime() > f > val t1 = System.nanoTime() > val elapsed = t1 - t0 + 0.0 > println(s"#$i: ${elapsed / 10.0}") > elapsed > } > println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s") > } > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val numCols = 50 > val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS > _c$i"): _*) > // Add conditions to take much time in Optimizer > val filter = (0 until 128).foldLeft(lit(false))((e, i) => > e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int"))) > val df2 = df.filter(filter).sort(df.columns(0)) > // Regular path > timer { > df2.filter(df2.col(df2.columns(0)) === lit(3)).collect > df2.filter(df2.col(df2.columns(0)) === lit(4)).collect > df2.filter(df2.col(df2.columns(0)) === lit(5)).collect > df2.filter(df2.col(df2.columns(0)) === lit(6)).collect > df2.filter(df2.col(df2.columns(0)) === lit(7)).collect > df2.filter(df2.col(df2.columns(0)) === lit(8)).collect > } > #0: 24.178487906 > #1: 22.619839888 > #2: 22.318617035 > #3: 22.131305502 > #4: 22.532095611 > #5: 22.245152778 > #6: 22.314114847 > #7: 22.284385952 > #8: 22.053593855 > Avg. Elapsed Time: 22.51973259712s > // Prepared path > val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared > timer { > df3b.bindParam("val", 3).collect > df3b.bindParam("val", 4).collect > df3b.bindParam("val", 5).collect > df3b.bindParam("val", 6).collect > df3b.bindParam("val", 7).collect > df3b.bindParam("val", 8).collect > } > #0: 0.744693912 > #1: 0.743187129 > #2: 0.74513 > #3: 0.721668718 > #4: 0.757573342 > #5: 0.763240883 > #6: 0.731287275 > #7: 0.728740601 > #8: 0.674275592 > Avg. Elapsed Time: 0.734418606112s > {code} > I'm not sure this approach is acceptable, so welcome any suggestion and > advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19293) Spark 2.1.x unstable with spark.speculation=true
[ https://issues.apache.org/jira/browse/SPARK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025974#comment-16025974 ] Damian Momot commented on SPARK-19293: -- Some additional errors for speculative tasks {code} java.lang.reflect.UndeclaredThrowableException at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2102) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1214) at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1210) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1210) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:399) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:355) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:159) ... 28 more {code} {code} java.lang.IllegalArgumentException: requirement failed: cannot write to a closed ByteBufferOutputStream at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:40) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1580) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:351) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:512) at org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:142) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} > Spark 2.1.x unstable with spark.speculation=true > > > Key: SPARK-19293 > URL: https://issues.apache.org/jira/browse/SPARK-19293 > Project: Spark >
[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders
[ https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025970#comment-16025970 ] Reynold Xin commented on SPARK-20895: - What's the benefit? > Support fast execution based on an optimized plan and parameter placeholders > > > Key: SPARK-20895 > URL: https://issues.apache.org/jira/browse/SPARK-20895 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In database scenarios, users sometimes use parameterized queries for repeated > execution (e.g., by using prepared statements). > So, I think this functionality is also useful for Spark users. > What I suggest here seems to be like: > My prototype here: > https://github.com/apache/spark/compare/master...maropu:PreparedStmt2 > {code} > scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t") > // Define a query with a parameter placeholder named `val` > scala> val df = sql("SELECT * FROM t WHERE col1 = $val") > scala> df.explain > == Physical Plan == > *Project [_1#13 AS col1#16, _2#14 AS col2#17] > +- *Filter (_1#13 = cast(parameterholder(val) as int)) >+- LocalTableScan [_1#13, _2#14] > // Apply optimizer rules and get an optimized logical plan with the parameter > placeholder > scala> val preparedDf = df.prepared > // Bind an actual value and do execution > scala> preparedDf.bindParam("val", 1).show() > +++ > |col1|col2| > +++ > | 1| 2| > +++ > {code} > To implement this, my prototype adds a new expression leaf node named > `ParameterHolder`. > In a binding phase, this node is replaced with `Literal` including an actual > value by using `bindParam`. > Currently, Spark sometimes consumes much time to rewrite logical plans in > `Optimizer` (e.g. constant propagation desribed in SPARK-19846). > So, I feel this approach is also helpful in that case: > {code} > def timer[R](f: => {}): Unit = { > val count = 9 > val iters = (0 until count).map { i => > val t0 = System.nanoTime() > f > val t1 = System.nanoTime() > val elapsed = t1 - t0 + 0.0 > println(s"#$i: ${elapsed / 10.0}") > elapsed > } > println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s") > } > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val numCols = 50 > val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS > _c$i"): _*) > // Add conditions to take much time in Optimizer > val filter = (0 until 128).foldLeft(lit(false))((e, i) => > e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int"))) > val df2 = df.filter(filter).sort(df.columns(0)) > // Regular path > timer { > df2.filter(df2.col(df2.columns(0)) === lit(3)).collect > df2.filter(df2.col(df2.columns(0)) === lit(4)).collect > df2.filter(df2.col(df2.columns(0)) === lit(5)).collect > df2.filter(df2.col(df2.columns(0)) === lit(6)).collect > df2.filter(df2.col(df2.columns(0)) === lit(7)).collect > df2.filter(df2.col(df2.columns(0)) === lit(8)).collect > } > #0: 24.178487906 > #1: 22.619839888 > #2: 22.318617035 > #3: 22.131305502 > #4: 22.532095611 > #5: 22.245152778 > #6: 22.314114847 > #7: 22.284385952 > #8: 22.053593855 > Avg. Elapsed Time: 22.51973259712s > // Prepared path > val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared > timer { > df3b.bindParam("val", 3).collect > df3b.bindParam("val", 4).collect > df3b.bindParam("val", 5).collect > df3b.bindParam("val", 6).collect > df3b.bindParam("val", 7).collect > df3b.bindParam("val", 8).collect > } > #0: 0.744693912 > #1: 0.743187129 > #2: 0.74513 > #3: 0.721668718 > #4: 0.757573342 > #5: 0.763240883 > #6: 0.731287275 > #7: 0.728740601 > #8: 0.674275592 > Avg. Elapsed Time: 0.734418606112s > {code} > I'm not sure this approach is acceptable, so welcome any suggestion and > advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders
[ https://issues.apache.org/jira/browse/SPARK-20895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025966#comment-16025966 ] Takeshi Yamamuro commented on SPARK-20895: -- cc: [~rxin][~cloud_fan][~smilegator] > Support fast execution based on an optimized plan and parameter placeholders > > > Key: SPARK-20895 > URL: https://issues.apache.org/jira/browse/SPARK-20895 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In database scenarios, users sometimes use parameterized queries for repeated > execution (e.g., by using prepared statements). > So, I think this functionality is also useful for Spark users. > What I suggest here seems to be like: > My prototype here: > https://github.com/apache/spark/compare/master...maropu:PreparedStmt2 > {code} > scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t") > // Define a query with a parameter placeholder named `val` > scala> val df = sql("SELECT * FROM t WHERE col1 = $val") > scala> df.explain > == Physical Plan == > *Project [_1#13 AS col1#16, _2#14 AS col2#17] > +- *Filter (_1#13 = cast(parameterholder(val) as int)) >+- LocalTableScan [_1#13, _2#14] > // Apply optimizer rules and get an optimized logical plan with the parameter > placeholder > scala> val preparedDf = df.prepared > // Bind an actual value and do execution > scala> preparedDf.bindParam("val", 1).show() > +++ > |col1|col2| > +++ > | 1| 2| > +++ > {code} > To implement this, my prototype adds a new expression leaf node named > `ParameterHolder`. > In a binding phase, this node is replaced with `Literal` including an actual > value by using `bindParam`. > Currently, Spark sometimes consumes much time to rewrite logical plans in > `Optimizer` (e.g. constant propagation desribed in SPARK-19846). > So, I feel this approach is also helpful in that case: > {code} > def timer[R](f: => {}): Unit = { > val count = 9 > val iters = (0 until count).map { i => > val t0 = System.nanoTime() > f > val t1 = System.nanoTime() > val elapsed = t1 - t0 + 0.0 > println(s"#$i: ${elapsed / 10.0}") > elapsed > } > println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s") > } > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val numCols = 50 > val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS > _c$i"): _*) > // Add conditions to take much time in Optimizer > val filter = (0 until 128).foldLeft(lit(false))((e, i) => > e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int"))) > val df2 = df.filter(filter).sort(df.columns(0)) > // Regular path > timer { > df2.filter(df2.col(df2.columns(0)) === lit(3)).collect > df2.filter(df2.col(df2.columns(0)) === lit(4)).collect > df2.filter(df2.col(df2.columns(0)) === lit(5)).collect > df2.filter(df2.col(df2.columns(0)) === lit(6)).collect > df2.filter(df2.col(df2.columns(0)) === lit(7)).collect > df2.filter(df2.col(df2.columns(0)) === lit(8)).collect > } > #0: 24.178487906 > #1: 22.619839888 > #2: 22.318617035 > #3: 22.131305502 > #4: 22.532095611 > #5: 22.245152778 > #6: 22.314114847 > #7: 22.284385952 > #8: 22.053593855 > Avg. Elapsed Time: 22.51973259712s > // Prepared path > val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared > timer { > df3b.bindParam("val", 3).collect > df3b.bindParam("val", 4).collect > df3b.bindParam("val", 5).collect > df3b.bindParam("val", 6).collect > df3b.bindParam("val", 7).collect > df3b.bindParam("val", 8).collect > } > #0: 0.744693912 > #1: 0.743187129 > #2: 0.74513 > #3: 0.721668718 > #4: 0.757573342 > #5: 0.763240883 > #6: 0.731287275 > #7: 0.728740601 > #8: 0.674275592 > Avg. Elapsed Time: 0.734418606112s > {code} > I'm not sure this approach is acceptable, so welcome any suggestion and > advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20895) Support fast execution based on an optimized plan and parameter placeholders
Takeshi Yamamuro created SPARK-20895: Summary: Support fast execution based on an optimized plan and parameter placeholders Key: SPARK-20895 URL: https://issues.apache.org/jira/browse/SPARK-20895 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Takeshi Yamamuro Priority: Minor In database scenarios, users sometimes use parameterized queries for repeated execution (e.g., by using prepared statements). So, I think this functionality is also useful for Spark users. What I suggest here seems to be like: My prototype here: https://github.com/apache/spark/compare/master...maropu:PreparedStmt2 {code} scala> Seq((1, 2), (2, 3)).toDF("col1", "col2").createOrReplaceTempView("t") // Define a query with a parameter placeholder named `val` scala> val df = sql("SELECT * FROM t WHERE col1 = $val") scala> df.explain == Physical Plan == *Project [_1#13 AS col1#16, _2#14 AS col2#17] +- *Filter (_1#13 = cast(parameterholder(val) as int)) +- LocalTableScan [_1#13, _2#14] // Apply optimizer rules and get an optimized logical plan with the parameter placeholder scala> val preparedDf = df.prepared // Bind an actual value and do execution scala> preparedDf.bindParam("val", 1).show() +++ |col1|col2| +++ | 1| 2| +++ {code} To implement this, my prototype adds a new expression leaf node named `ParameterHolder`. In a binding phase, this node is replaced with `Literal` including an actual value by using `bindParam`. Currently, Spark sometimes consumes much time to rewrite logical plans in `Optimizer` (e.g. constant propagation desribed in SPARK-19846). So, I feel this approach is also helpful in that case: {code} def timer[R](f: => {}): Unit = { val count = 9 val iters = (0 until count).map { i => val t0 = System.nanoTime() f val t1 = System.nanoTime() val elapsed = t1 - t0 + 0.0 println(s"#$i: ${elapsed / 10.0}") elapsed } println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + "s") } import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val numCols = 50 val df = spark.range(100).selectExpr((0 until numCols).map(i => s"id AS _c$i"): _*) // Add conditions to take much time in Optimizer val filter = (0 until 128).foldLeft(lit(false))((e, i) => e.or(df.col(df.columns(i % numCols)) === (rand() * 10).cast("int"))) val df2 = df.filter(filter).sort(df.columns(0)) // Regular path timer { df2.filter(df2.col(df2.columns(0)) === lit(3)).collect df2.filter(df2.col(df2.columns(0)) === lit(4)).collect df2.filter(df2.col(df2.columns(0)) === lit(5)).collect df2.filter(df2.col(df2.columns(0)) === lit(6)).collect df2.filter(df2.col(df2.columns(0)) === lit(7)).collect df2.filter(df2.col(df2.columns(0)) === lit(8)).collect } #0: 24.178487906 #1: 22.619839888 #2: 22.318617035 #3: 22.131305502 #4: 22.532095611 #5: 22.245152778 #6: 22.314114847 #7: 22.284385952 #8: 22.053593855 Avg. Elapsed Time: 22.51973259712s // Prepared path val df3b = df2.filter(df2.col(df2.columns(0)) === param("val")).prepared timer { df3b.bindParam("val", 3).collect df3b.bindParam("val", 4).collect df3b.bindParam("val", 5).collect df3b.bindParam("val", 6).collect df3b.bindParam("val", 7).collect df3b.bindParam("val", 8).collect } #0: 0.744693912 #1: 0.743187129 #2: 0.74513 #3: 0.721668718 #4: 0.757573342 #5: 0.763240883 #6: 0.731287275 #7: 0.728740601 #8: 0.674275592 Avg. Elapsed Time: 0.734418606112s {code} I'm not sure this approach is acceptable, so welcome any suggestion and advice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20775) from_json should also have an API where the schema is specified with a string
[ https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025962#comment-16025962 ] Ruben Janssen commented on SPARK-20775: --- Hi, it is RubenJanssen :) > from_json should also have an API where the schema is specified with a string > - > > Key: SPARK-20775 > URL: https://issues.apache.org/jira/browse/SPARK-20775 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Burak Yavuz > Fix For: 2.3.0 > > > Right now you also have to provide a java.util.Map which is not nice for > Scala users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-20498: Sure please go ahead On Fri, May 26, 2017 at 12:55 AM Yan Facai (颜发才) (JIRA)> RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025949#comment-16025949 ] Yan Facai (颜发才) commented on SPARK-20498: - [~iamshrek] Hi, Xin Ren. As the task is quite easy, if you are a little busy, I'm glad to work on it. Is it OK? > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025932#comment-16025932 ] pralabhkumar commented on SPARK-20199: -- 1) Have Created pull request. Basically Moved 1) featureSubsetStrategy to TreeEnsembleParams instead of having it on RandomForestParams . So that it can be used for both Random Forest and GBT 2 ) Changed DecisionTreeRegressor private train method to pass featureSubsetStrategy 3) To Test changed GradientBoostedTreeClassifierExample with val gbt = new GBTClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") .setMaxIter(10) .setFeatureSubsetStrategy("auto") > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025927#comment-16025927 ] Apache Spark commented on SPARK-19372: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18119 > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Assignee: Kazuaki Ishizaki > Fix For: 2.3.0 > > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-20199: - Summary: GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter (was: GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter) > GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025923#comment-16025923 ] Apache Spark commented on SPARK-20199: -- User 'pralabhkumar' has created a pull request for this issue: https://github.com/apache/spark/pull/18118 > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20199: Assignee: Apache Spark > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar >Assignee: Apache Spark > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20199: Assignee: (was: Apache Spark) > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org