[jira] [Updated] (SPARK-18243) Converge the insert path of Hive tables with data source tables
[ https://issues.apache.org/jira/browse/SPARK-18243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18243: Assignee: Wenchen Fan > Converge the insert path of Hive tables with data source tables > --- > > Key: SPARK-18243 > URL: https://issues.apache.org/jira/browse/SPARK-18243 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > Inserting data into Hive tables has its own implementation that is distinct > from data sources: InsertIntoHiveTable, SparkHiveWriterContainer and > SparkHiveDynamicPartitionWriterContainer. > I think it should be possible to unify these with data source implementations > InsertIntoHadoopFsRelationCommand. We can start by implementing an > OutputWriterFactory/OutputWriter that uses Hive's serdes to write data. > Note that one other major difference is that data source tables write > directly to the final destination without using some staging directory, and > then Spark itself adds the partitions/tables to the catalog. Hive tables > actually write to some staging directory, and then call Hive metastore's > loadPartition/loadTable function to load those data in. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18243) Converge the insert path of Hive tables with data source tables
[ https://issues.apache.org/jira/browse/SPARK-18243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18243. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16517 [https://github.com/apache/spark/pull/16517] > Converge the insert path of Hive tables with data source tables > --- > > Key: SPARK-18243 > URL: https://issues.apache.org/jira/browse/SPARK-18243 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 2.2.0 > > > Inserting data into Hive tables has its own implementation that is distinct > from data sources: InsertIntoHiveTable, SparkHiveWriterContainer and > SparkHiveDynamicPartitionWriterContainer. > I think it should be possible to unify these with data source implementations > InsertIntoHadoopFsRelationCommand. We can start by implementing an > OutputWriterFactory/OutputWriter that uses Hive's serdes to write data. > Note that one other major difference is that data source tables write > directly to the final destination without using some staging directory, and > then Spark itself adds the partitions/tables to the catalog. Hive tables > actually write to some staging directory, and then call Hive metastore's > loadPartition/loadTable function to load those data in. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827547#comment-15827547 ] Xiao Li commented on SPARK-19115: - Sorry for the late reply. This sounds reasonable. Does Hive support such a query? > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupports the command " create external table if not exist > new_tbl like old_tbl location '/warehouse/new_tbl' " > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier locationSpec? > #createTableLike" > modify method 'visitCreateTableLike' in scala file 'SparkSqlParser.scala' > and update case class CreateTableLikeCommand in 'tables.scala' file > finally we compiled spark and replaced the jars as follow: > 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run the command > 'create external table if not exist new_tbl like old_tbl location > '/warehouse/new_tbl' successfully . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang reopened SPARK-19115: - spark2.x unsupports the sql command: create external table if not exists gen_tbl like src_tbl location '/warehouse/gen_tbl' > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupports the command " create external table if not exist > new_tbl like old_tbl location '/warehouse/new_tbl' " > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier locationSpec? > #createTableLike" > modify method 'visitCreateTableLike' in scala file 'SparkSqlParser.scala' > and update case class CreateTableLikeCommand in 'tables.scala' file > finally we compiled spark and replaced the jars as follow: > 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run the command > 'create external table if not exist new_tbl like old_tbl location > '/warehouse/new_tbl' successfully . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-19115: Description: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier locationSpec? #createTableLike" modify method 'visitCreateTableLike' in scala file 'SparkSqlParser.scala' and update case class CreateTableLikeCommand in 'tables.scala' file finally we compiled spark and replaced the jars as follow: 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run the command 'create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' successfully . was: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier locationSpec? #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupports the command " create external table if not exist > new_tbl like old_tbl location '/warehouse/new_tbl' " > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier locationSpec? > #createTableLike" > modify method 'visitCreateTableLike' in scala file 'SparkSqlParser.scala' > and update case class CreateTableLikeCommand in 'tables.scala' file > finally we compiled spark and replaced the jars as follow: > 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run the command > 'create external table if not exist new_tbl like old_tbl location > '/warehouse/new_tbl' successfully . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-19115: Description: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier locationSpec? #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . was: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupports the command " create external table if not exist > new_tbl like old_tbl location '/warehouse/new_tbl' " > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier locationSpec? > #createTableLike" > and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" > ,after that,we found we can run command "create external table if not exist > new_tbl like old_tbl" successfully,unfortunately we found the generated > table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore > database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-19115: Description: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . was: spark2.0.1 unsupported the command " create external table if not exist new_tbl like old_tbl" we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupports the command " create external table if not exist > new_tbl like old_tbl " > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" > ,after that,we found we can run command "create external table if not exist > new_tbl like old_tbl" successfully,unfortunately we found the generated > table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore > database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-19115: Description: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . was: spark2.0.1 unsupports the command " create external table if not exist new_tbl like old_tbl " we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupports the command " create external table if not exist > new_tbl like old_tbl location '/warehouse/new_tbl' " > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" > ,after that,we found we can run command "create external table if not exist > new_tbl like old_tbl" successfully,unfortunately we found the generated > table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore > database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "
[ https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-19115: Summary: SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' " (was: SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl") > SparkSQL unsupports the command " create external table if not exist new_tbl > like old_tbl location '/warehouse/new_tbl' " > -- > > Key: SPARK-19115 > URL: https://issues.apache.org/jira/browse/SPARK-19115 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 hive1.2.1 >Reporter: Xiaochen Ouyang > > spark2.0.1 unsupported the command " create external table if not exist > new_tbl like old_tbl" > we tried to modify the sqlbase.g4 file,change > "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > to > "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier > LIKE source=tableIdentifier > #createTableLike" > and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" > ,after that,we found we can run command "create external table if not exist > new_tbl like old_tbl" successfully,unfortunately we found the generated > table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore > database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19270) Add summary table to GLM summary
[ https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19270: Assignee: (was: Apache Spark) > Add summary table to GLM summary > > > Key: SPARK-19270 > URL: https://issues.apache.org/jira/browse/SPARK-19270 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Wayne Zhang >Priority: Minor > > Add R-like summary table to GLM summary, which includes feature name (if > exist), parameter estimate, standard error, t-stat and p-value. This allows > scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19270) Add summary table to GLM summary
[ https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19270: Assignee: Apache Spark > Add summary table to GLM summary > > > Key: SPARK-19270 > URL: https://issues.apache.org/jira/browse/SPARK-19270 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Wayne Zhang >Assignee: Apache Spark >Priority: Minor > > Add R-like summary table to GLM summary, which includes feature name (if > exist), parameter estimate, standard error, t-stat and p-value. This allows > scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19270) Add summary table to GLM summary
[ https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827521#comment-15827521 ] Apache Spark commented on SPARK-19270: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/16630 > Add summary table to GLM summary > > > Key: SPARK-19270 > URL: https://issues.apache.org/jira/browse/SPARK-19270 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Wayne Zhang >Priority: Minor > > Add R-like summary table to GLM summary, which includes feature name (if > exist), parameter estimate, standard error, t-stat and p-value. This allows > scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2868) Support named accumulators in Python
[ https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827519#comment-15827519 ] Kyle Heath commented on SPARK-2868: --- Short version: Is there anything I can do to help bring this feature to pyspark? Long version: I've been implementing large jobs in pyspark for about 6 months. The ability to monitor named accumulators in the web-ui seems really important. Running complex jobs at scale has been a bit like flying blind. I'm new to this community, but want to help if I can. > Support named accumulators in Python > > > Key: SPARK-2868 > URL: https://issues.apache.org/jira/browse/SPARK-2868 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Patrick Wendell > > SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to > make some additional changes. One potential path is to have a 1:1 > correspondence with Scala accumulators (instead of a one-to-many). A > challenge is exposing the stringified values of the accumulators to the Scala > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8480) Add setName for Dataframe
[ https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827515#comment-15827515 ] Kaushal Prajapati commented on SPARK-8480: -- For example, Using SparkContext we can get all cached Rdds {code:title=Code|borderStyle=solid} scala> val rdd = sc.range(1,1000) scala> rdd.setName("myRdd") scala> rdd.cache scala> rdd.count scala> sc.getPersistentRDDs.foreach(println) (9,kaushal MapPartitionsRDD[9] at rdd at :27) (11,myRdd MapPartitionsRDD[11] at range at :27) sc.getPersistentRDDs.filter(_._2.name == "myRdd").foreach(_._2.unpersist()) scala> sc.getPersistentRDDs.foreach(println) (9,kaushal MapPartitionsRDD[9] at rdd at :27) {code} And we can unpersist any Rdd with valid name. Likewise same in DataSet, if we will able to list all cached DataSets with corresponding names then it will be good option to unpersist any DataSet using particular name. > Add setName for Dataframe > - > > Key: SPARK-8480 > URL: https://issues.apache.org/jira/browse/SPARK-8480 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.4.0 >Reporter: Peter Rudenko >Priority: Minor > > Rdd has a method setName, so in spark UI, it's more easily to understand > what's this cache for. E.g. ("data for LogisticRegression model", etc.). > Would be nice to have the same method for Dataframe, since it displays a > logical schema, in cache page, which could be quite big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro closed SPARK-19081. Resolution: Resolved Fix Version/s: 1.5.0 > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > Fix For: 1.5.0 > > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19270) Add summary table to GLM summary
Wayne Zhang created SPARK-19270: --- Summary: Add summary table to GLM summary Key: SPARK-19270 URL: https://issues.apache.org/jira/browse/SPARK-19270 Project: Spark Issue Type: Improvement Components: ML Reporter: Wayne Zhang Priority: Minor Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827510#comment-15827510 ] Takeshi Yamamuro edited comment on SPARK-19081 at 1/18/17 6:54 AM: --- Since the issue the reporter says has been handled in v1.5, I'll close this as resolved. was (Author: maropu): Since the issue the reporter says has been handled in v1.6, I'll close this as resolved. > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827510#comment-15827510 ] Takeshi Yamamuro commented on SPARK-19081: -- Since the issue the reporter says has been handled in v1.6, I'll close this as resolved. > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827506#comment-15827506 ] Takeshi Yamamuro commented on SPARK-19081: -- congrats! > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827503#comment-15827503 ] Davy Song commented on SPARK-19081: --- @Takeshi, thanks very much. I have rewritten the udf which inherited GenericUDF to solve this problem, and now the script runs well. > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19269) Scheduler Delay in spark ui
xdcjie created SPARK-19269: -- Summary: Scheduler Delay in spark ui Key: SPARK-19269 URL: https://issues.apache.org/jira/browse/SPARK-19269 Project: Spark Issue Type: Question Components: Web UI Affects Versions: 2.0.2 Reporter: xdcjie In sparkUI, SchedulerDelay is launchtime(in driver) - starttime (in executor), but in the code, that it consists of launchtime - starttime and statupdate (executor update status to driver time). private[ui] def getSchedulerDelay( info: TaskInfo, metrics: TaskMetricsUIData, currentTime: Long): Long = { if (info.finished) { val totalExecutionTime = info.finishTime - info.launchTime val executorOverhead = (metrics.executorDeserializeTime + metrics.resultSerializationTime) math.max( 0, totalExecutionTime - metrics.executorRunTime - executorOverhead - getGettingResultTime(info, currentTime)) } else { // The task is still running and the metrics like executorRunTime are not available. 0L } } Could we split "scheduler delay" into "scheduler delay time (launchtime-starttime) and "status update time". in my workspace, scheduler delay time is 60s, but launchtime - starttime is very fast, <1s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18011) SparkR serialize "NA" throws exception
[ https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827478#comment-15827478 ] Felix Cheung commented on SPARK-18011: -- very cool, thanks for all the investigation. What version of R you are running on? Is this on Mac or Linux? > SparkR serialize "NA" throws exception > -- > > Key: SPARK-18011 > URL: https://issues.apache.org/jira/browse/SPARK-18011 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang > > For some versions of R, if Date has "NA" field, backend will throw negative > index exception. > To reproduce the problem: > {code} > > a <- as.Date(c("2016-11-11", "NA")) > > b <- as.data.frame(a) > > c <- createDataFrame(b) > > dim(c) > 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NegativeArraySizeException > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) > at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827475#comment-15827475 ] Felix Cheung commented on SPARK-12347: -- [~ethanlu...@gmail.com] would you be willing and able to spend more time on this? I would be interested in working with you to shepherd this for 2.2. > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary
[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827474#comment-15827474 ] Felix Cheung commented on SPARK-18348: -- [~yanboliang] would you like to run with this? you had a few great points when we discussed this earlier. > Improve tree ensemble model summary > --- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel errorxerror xstd > 1 0.17647059 0 1.000 1.000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.0100 4 0.7647059 1.0588235 0.2200975 > Variable importance > StartAge Number > 64 24 12 > Node number 1: 81 observations,complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts:6417 >probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age< 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts:56 6 >probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age< 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age< 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 811 >probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts:29 0 >probabilities: 1.000 0.000 > Node number 5: 33 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts:27 6 >probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age< 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts:12 0 >probabilities: 1.000 0.000 > Node number 11: 21 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts:15 6 >probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age< 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts:12 2 >probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 >probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19066) SparkR LDA doesn't set optimizer correctly
[ https://issues.apache.org/jira/browse/SPARK-19066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19066: - Fix Version/s: 2.1.1 > SparkR LDA doesn't set optimizer correctly > -- > > Key: SPARK-19066 > URL: https://issues.apache.org/jira/browse/SPARK-19066 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.1.1, 2.2.0 > > > spark.lda pass the optimizer "em" or "online" to the backend. However, > LDAWrapper doesn't set optimizer based on the value from R. Therefore, for > optimizer "em", the `isDistributed` field is FALSE, which should be TRUE. > In addition, the `summary` method should bring back the results related to > `DistributedLDAModel`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19255) SQL Listener is causing out of memory, in case of large no of shuffle partition
[ https://issues.apache.org/jira/browse/SPARK-19255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827379#comment-15827379 ] Ashok Kumar commented on SPARK-19255: - @Takeshi , thanks for looking into the issue. You are right, we can handle it by increasing driver memory, but our actual scenario is 10 million shuffle partition which is 100 times. After analysing the code it is found that when user looks for metrics via spark/sql ui we are merging all accumulator data and then displaying it. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L336 So one suggestion is, how if we merge accumulator metrics after every configured number of task. In this case out of memory issue will not occur. Please suggest, looking forward for your input. > SQL Listener is causing out of memory, in case of large no of shuffle > partition > --- > > Key: SPARK-19255 > URL: https://issues.apache.org/jira/browse/SPARK-19255 > Project: Spark > Issue Type: Improvement > Components: SQL > Environment: Linux >Reporter: Ashok Kumar >Priority: Minor > Attachments: spark_sqllistener_oom.png > > > Test steps. > 1.CREATE TABLE sample(imei string,age int,task bigint,num double,level > decimal(10,3),productdate timestamp,name string,point int)USING > com.databricks.spark.csv OPTIONS (path "data.csv", header "false", > inferSchema "false"); > 2. set spark.sql.shuffle.partitions=10; > 3. select count(*) from (select task,sum(age) from sample group by task) t; > After running above query, number of objects in map variable > _stageIdToStageMetrics has increase to very high number , this increment is > proportional to number of shuffle partition. > Please have a look at attached screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827371#comment-15827371 ] Genmao Yu commented on SPARK-19185: --- [~c...@koeninger.org] I provide a PR to give more clear hint for users when encounter this problem. > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) >
[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19185: Assignee: Apache Spark > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Apache Spark > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) >
[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19185: Assignee: (was: Apache Spark) > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at
[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827368#comment-15827368 ] Apache Spark commented on SPARK-19185: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16629 > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at >
[jira] [Commented] (SPARK-18627) Cannot connect to Hive metastore in client mode with proxy user
[ https://issues.apache.org/jira/browse/SPARK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827347#comment-15827347 ] Mubashir Kazia commented on SPARK-18627: bq. it doesn't explain why that works in cluster mode. Just a wild guess: it probably also falls back to GSSAPI auth, but it is successful because there is a yarn TGT available in the UGI/subject/ticket cache because of the AMDelegationTokenRenewer. I can't confirm this is the case because the HMS Audit logging does not log the real user. bq. In fact, client mode shouldn't even need delegation tokens for the HMS, although with a proxy user maybe things get a little messed up. Agree. > Cannot connect to Hive metastore in client mode with proxy user > --- > > Key: SPARK-18627 > URL: https://issues.apache.org/jira/browse/SPARK-18627 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Marking as "minor" since the security story for client mode with proxy users > is a little sketchy to start with, but it shouldn't fail, at least not in > this manner. Error you get is: > {noformat} > Caused by: MetaException(message:Could not connect to meta store using any of > the URIs provided. Most recent failure: > org.apache.thrift.transport.TTransportException: GSS initiate failed > at > org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:430) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:240) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1528) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3238) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3257) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3482) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:225) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:209) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:332) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:268) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249) > {noformat} > Cluster mode works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827332#comment-15827332 ] Takeshi Yamamuro commented on SPARK-19081: -- yea, great! Any action for this? If no, I'll make this issue resolved. > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827330#comment-15827330 ] Genmao Yu commented on SPARK-19185: --- IMHO, persisting can not cover all scene, just like executor failed or cached data loss. Maybe, we can expose much clearer hint, like "KafkaConsumer is not safe for multi-threaded access, you may set 'useConsumerCache' as false, or do not run multi-tasks in single executor" or something else. Besides, we may expose useConsumerCache with a configuration (true as default). This will not chang the behavior on a widespread basis. > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at >
[jira] [Commented] (SPARK-18627) Cannot connect to Hive metastore in client mode with proxy user
[ https://issues.apache.org/jira/browse/SPARK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827328#comment-15827328 ] Marcelo Vanzin commented on SPARK-18627: bq. The title of this jira is also incorrect. It is successfully able to fetch HMS delegation token Right, changed. bq. Spark connects to and uses HMS not HS2. It fetches a delegation token for HMS...but stores it in the credential cache with the label for HS2 Although that does look fishy, it doesn't explain why that works in cluster mode. In fact, client mode shouldn't even need delegation tokens for the HMS, although with a proxy user maybe things get a little messed up. > Cannot connect to Hive metastore in client mode with proxy user > --- > > Key: SPARK-18627 > URL: https://issues.apache.org/jira/browse/SPARK-18627 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Marking as "minor" since the security story for client mode with proxy users > is a little sketchy to start with, but it shouldn't fail, at least not in > this manner. Error you get is: > {noformat} > Caused by: MetaException(message:Could not connect to meta store using any of > the URIs provided. Most recent failure: > org.apache.thrift.transport.TTransportException: GSS initiate failed > at > org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:430) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:240) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1528) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3238) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3257) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3482) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:225) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:209) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:332) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:268) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249) > {noformat} > Cluster mode works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-19250) In security cluster, spark beeline connect to hive metastore failed
[ https://issues.apache.org/jira/browse/SPARK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-19250: - Description: 1. starting thriftserver in security mode, set hive.metastore.uris to hive metastore uri, also hive is in security mode. 2. when use beeline to create table, it can't connect to hive metastore successfully, occurs "Failed to find any Kerberos tgt". {quote} 2017-01-17 16:25:53,618 | ERROR | [pool-25-thread-1] | SASL negotiation failure | org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:315) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738) at org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:513) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:249) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1533) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3119) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3138) at org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:791) at org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:755) at org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1461) at org.apache.hadoop.hive.ql.session.SessionState.getUserFromAuthenticator(SessionState.java:1014) at org.apache.hadoop.hive.ql.metadata.Table.getEmptyTable(Table.java:177) at org.apache.hadoop.hive.ql.metadata.Table.(Table.java:119) at org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$toHiveTable(HiveClientImpl.scala:803) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:430) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:430) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:430) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:284) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:429) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:229) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:191) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:191) at
[jira] [Updated] (SPARK-18627) Cannot connect to Hive metastore in client mode with proxy user
[ https://issues.apache.org/jira/browse/SPARK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18627: --- Summary: Cannot connect to Hive metastore in client mode with proxy user (was: Cannot fetch Hive delegation tokens in client mode with proxy user) > Cannot connect to Hive metastore in client mode with proxy user > --- > > Key: SPARK-18627 > URL: https://issues.apache.org/jira/browse/SPARK-18627 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Marking as "minor" since the security story for client mode with proxy users > is a little sketchy to start with, but it shouldn't fail, at least not in > this manner. Error you get is: > {noformat} > Caused by: MetaException(message:Could not connect to meta store using any of > the URIs provided. Most recent failure: > org.apache.thrift.transport.TTransportException: GSS initiate failed > at > org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:430) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:240) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1528) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3238) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3257) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3482) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:225) > at > org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:209) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:332) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293) > at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:268) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249) > {noformat} > Cluster mode works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19266) Ensure DiskStore properly encrypts cached data on disk.
[ https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-19266: --- Summary: Ensure DiskStore properly encrypts cached data on disk. (was: DiskStore does not encrypt serialized RDD data) > Ensure DiskStore properly encrypts cached data on disk. > --- > > Key: SPARK-19266 > URL: https://issues.apache.org/jira/browse/SPARK-19266 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without > encrypting (or compressing) it. So any cached blocks that are evicted to disk > when using {{MEMORY_AND_DISK_SER}} will not be encrypted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19250) In security cluster, spark beeline connect to hive metastore failed
[ https://issues.apache.org/jira/browse/SPARK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827320#comment-15827320 ] meiyoula commented on SPARK-19250: -- [~rxin] I think you may understand this bug. I find u set hive.metastore.uris=“” to thriftsever.hiveConf in commit https://github.com/apache/spark/commit/054f991c4350af1350af7a4109ee77f4a34822f0#diff-709404b0d3defeff035ef0c4f5a960e5. But when it set to local metastore, beeline openSession will not obtain token and connect remote metastore will failed. Can u have a look and give som ideas? Thanks! > In security cluster, spark beeline connect to hive metastore failed > --- > > Key: SPARK-19250 > URL: https://issues.apache.org/jira/browse/SPARK-19250 > Project: Spark > Issue Type: Bug >Reporter: meiyoula > Labels: security-issue > > 1. starting thriftserver in security mode, set hive.metastore.uris to hive > metastore uri, also hive is in security mode. > 2. when use beeline to create table, it can't connect to hive metastore > successfully, occurs "Failed to find any Kerberos tgt". > Reason: > When open hivemetastore client, first check if has token, because the > hive.metastore.uris has been set to local, so it don't obtain token; secondly > use tgt to auth, but current user is a proxyuser. So open metastore client > failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19266) DiskStore does not encrypt serialized RDD data
[ https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-19266: --- Priority: Minor (was: Major) > DiskStore does not encrypt serialized RDD data > -- > > Key: SPARK-19266 > URL: https://issues.apache.org/jira/browse/SPARK-19266 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without > encrypting (or compressing) it. So any cached blocks that are evicted to disk > when using {{MEMORY_AND_DISK_SER}} will not be encrypted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19266) DiskStore does not encrypt serialized RDD data
[ https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827315#comment-15827315 ] Marcelo Vanzin commented on SPARK-19266: Hmm... seems {{partiallySerializedValues.finishWritingToStream}} will actually keep writing encrypted data when it's called (it uses a {{RedirectableOutputStream}} which just replaces the byte array stream with a file stream, keeping the existing encryption and compression state). But I can't find where that spilled data is read, and I'm not sure whether that's properly covered by existing tests. I'll try to write a test to exercise this path; worst case we have a new test that makes sure it works. > DiskStore does not encrypt serialized RDD data > -- > > Key: SPARK-19266 > URL: https://issues.apache.org/jira/browse/SPARK-19266 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without > encrypting (or compressing) it. So any cached blocks that are evicted to disk > when using {{MEMORY_AND_DISK_SER}} will not be encrypted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19250) In security cluster, spark beeline connect to hive metastore failed
[ https://issues.apache.org/jira/browse/SPARK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-19250: - Labels: security-issue (was: ) > In security cluster, spark beeline connect to hive metastore failed > --- > > Key: SPARK-19250 > URL: https://issues.apache.org/jira/browse/SPARK-19250 > Project: Spark > Issue Type: Bug >Reporter: meiyoula > Labels: security-issue > > 1. starting thriftserver in security mode, set hive.metastore.uris to hive > metastore uri, also hive is in security mode. > 2. when use beeline to create table, it can't connect to hive metastore > successfully, occurs "Failed to find any Kerberos tgt". > Reason: > When open hivemetastore client, first check if has token, because the > hive.metastore.uris has been set to local, so it don't obtain token; secondly > use tgt to auth, but current user is a proxyuser. So open metastore client > failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero
[ https://issues.apache.org/jira/browse/SPARK-19234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827236#comment-15827236 ] Andrew MacKinlay edited comment on SPARK-19234 at 1/18/17 1:53 AM: --- [~yanboliang] I presume that you are the author judging by Github commits? Do you have an opinion on this? was (Author: admackin): [~yanbo] I presume that you are the author judging by Github commits? Do you have an opinion on this? > AFTSurvivalRegression chokes silently or with confusing errors when any > labels are zero > --- > > Key: SPARK-19234 > URL: https://issues.apache.org/jira/browse/SPARK-19234 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark-shell or pyspark >Reporter: Andrew MacKinlay >Priority: Minor > Attachments: spark-aft-failure.txt > > > If you try and use AFTSurvivalRegression and any label in your input data is > 0.0, you get coefficients of 0.0 returned, and in many cases, errors like > this: > {{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in > function evaluation. Decreasing step size to NaN}} > Zero should, I think, be an allowed value for survival analysis. I don't know > if this is a pathological case for AFT specifically as I don't know enough > about it, but this behaviour is clearly undesirable. If you have any labels > of 0.0, you get either a) obscure error messages, with no knowledge of the > cause and coefficients which are all zero or b) no errors messages at all and > coefficients of zero (arguably worse, since you don't even have console > output to tell you something's gone awry). If AFT doesn't work with > zero-valued labels, Spark should fail fast and let the developer know why. If > it does, we should get results here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827237#comment-15827237 ] Shivaram Venkataraman commented on SPARK-16578: --- I think its fine to remove the target version for this - As [~mengxr] said its not clear what the requirements are for this deploy mode or what kind of applications will use this etc. If somebody puts that together we can retarget ? > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero
[ https://issues.apache.org/jira/browse/SPARK-19234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827236#comment-15827236 ] Andrew MacKinlay commented on SPARK-19234: -- [~yanbo] I presume that you are the author judging by Github commits? Do you have an opinion on this? > AFTSurvivalRegression chokes silently or with confusing errors when any > labels are zero > --- > > Key: SPARK-19234 > URL: https://issues.apache.org/jira/browse/SPARK-19234 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark-shell or pyspark >Reporter: Andrew MacKinlay >Priority: Minor > Attachments: spark-aft-failure.txt > > > If you try and use AFTSurvivalRegression and any label in your input data is > 0.0, you get coefficients of 0.0 returned, and in many cases, errors like > this: > {{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in > function evaluation. Decreasing step size to NaN}} > Zero should, I think, be an allowed value for survival analysis. I don't know > if this is a pathological case for AFT specifically as I don't know enough > about it, but this behaviour is clearly undesirable. If you have any labels > of 0.0, you get either a) obscure error messages, with no knowledge of the > cause and coefficients which are all zero or b) no errors messages at all and > coefficients of zero (arguably worse, since you don't even have console > output to tell you something's gone awry). If AFT doesn't work with > zero-valued labels, Spark should fail fast and let the developer know why. If > it does, we should get results here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
[ https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyan updated SPARK-19268: -- Description: bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 subscribe topic03 when i run the spark example raises the following error: {quote} Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(4) org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost task 2.0 in stage 9.0 (TID 46, localhost, executor driver): java.lang.IllegalStateException: Error reading delta file /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of HDFSStateStoreProvider[id = (op=0, part=2), dir = /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at
[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827201#comment-15827201 ] hustfxj edited comment on SPARK-19264 at 1/18/17 1:39 AM: -- So I must change the last code like that {code} try { ssc.awaitTermination() } finally { System.exit(-1); } {code} I think the user shouldn't care this issue.Since the spark provide the awaitTermination(), thus we should make the driver quit the main thread done whether the driver still contains the unfinished non-daemon theads or not. was (Author: hustfxj): So I must change the last code like that {code} try { ssc.awaitTermination() } finally { System.exit(-1); } {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
[ https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyan updated SPARK-19268: -- Description: bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 subscribe topic03 when i run the spark example raises the following error: {quote} Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(4) org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost task 2.0 in stage 9.0 (TID 46, localhost, executor driver): java.lang.IllegalStateException: Error reading delta file /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of HDFSStateStoreProvider[id = (op=0, part=2), dir = /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at
[jira] [Updated] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
[ https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyan updated SPARK-19268: -- Description: bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 subscribe topic03 when i run the spark example raises the following error: {quote} Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(4) org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost task 2.0 in stage 9.0 (TID 46, localhost, executor driver): java.lang.IllegalStateException: Error reading delta file /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of HDFSStateStoreProvider[id = (op=0, part=2), dir = /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at
[jira] [Created] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
liyan created SPARK-19268: - Summary: File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta Key: SPARK-19268 URL: https://issues.apache.org/jira/browse/SPARK-19268 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Environment: - hadoop2.7 - Java 7 Reporter: liyan bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 subscribe topic03 when i run the spark example raises the following error: {quote} Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(4) org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost task 2.0 in stage 9.0 (TID 46, localhost, executor driver): java.lang.IllegalStateException: Error reading delta file /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of HDFSStateStoreProvider[id = (op=0, part=2), dir = /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587) at
[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827215#comment-15827215 ] Helena Edelson commented on SPARK-19185: I've seen this as well, the exceptions, as expected, are never raised if not using the cache. Spark 2.1. The exception is raised in the seek function. I added an opt-in config for that temporarily but will work on a better solution. Perf hits aren't something I can do :) > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at
[jira] [Commented] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827201#comment-15827201 ] hustfxj commented on SPARK-19264: - So I must change the last code like that {code} try { ssc.awaitTermination() } finally { System.exit(-1); } {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value
[ https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827199#comment-15827199 ] Davy Song commented on SPARK-19081: --- Thanks Takeshi, you are right. I test map on Databricks's spark-sql 2.0 and hit the exception as follows: com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: Map type in java is unsupported because JVM type erasure makes spark fail to catch key and value types in Map<>; line 2 pos 12 at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:234) at org.apache.spark.sql.hive.HiveSimpleUDF.javaClassToDataType(hiveUDFs.scala:41) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:71) at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:71) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionBuilder$1.apply(HiveSessionCatalog.scala:126) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionBuilder$1.apply(HiveSessionCatalog.scala:122) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:87) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:853) at org.apache.spark.sql.hive.HiveSessionCatalog.org$apache$spark$sql$hive$HiveSessionCatalog$$super$lookupFunction(HiveSessionCatalog.scala:186) at > spark sql use HIVE UDF throw exception when return a Map value > -- > > Key: SPARK-19081 > URL: https://issues.apache.org/jira/browse/SPARK-19081 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davy Song > > I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582, > but not with this parameter Map, my evaluate function return a Map: > public Mapevaluate(Text url) {...} > when run spark-sql with this udf, getting the following exception: > scala.MatchError: interface java.util.Map (of class java.lang.Class) > at > org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175) > at > org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144) > at > org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181 ] hustfxj edited comment on SPARK-19264 at 1/18/17 1:18 AM: -- [~srowen] sorry, I didn't say it clearly. I means the driver can't be done when it contains other unfinished non-daemon threads. Look at the followed example. the driver program should crash due to the exception. But In fact the driver program can't crash because the timer threads still are runnning. {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} was (Author: hustfxj): [~srowen] sorry, I didn't say it clearly. I means the spark application can't be done when it contains other unfinished non-daemon. Look at the followed example. the driver program should crash due to the exception. But In fact the driver program can't crash because the timer threads still are runnning. {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181 ] hustfxj edited comment on SPARK-19264 at 1/18/17 1:16 AM: -- [~srowen] sorry, I didn't say it clearly. I means the spark application can't be done when it contains other unfinished non-daemon. Look at the followed example. the driver program should crash due to the exception. But In fact the driver program can't crash because the timer threads still are runnning. {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} was (Author: hustfxj): [~srowen] sorry, I didn't say it clearly. I means the spark application can't be done when it contains other unfinished non-daemon. Look at the follows example. the driver program should crash due to the exception. But In fact the driver program can't crash because the timer threads still are runnning. {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181 ] hustfxj edited comment on SPARK-19264 at 1/18/17 1:16 AM: -- [~srowen] sorry, I didn't say it clearly. I means the spark application can't be done when it contains other unfinished non-daemon. Look at the follows example. the driver program should crash due to the exception. But In fact the driver program can't crash because the timer threads still are runnning. {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} was (Author: hustfxj): [~srowen] sorry, I didn't say it clearly. I means an spark application can't be done when it contains other unfinished non-daemon. for example {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181 ] hustfxj edited comment on SPARK-19264 at 1/18/17 1:12 AM: -- [~srowen] sorry, I didn't say it clearly. I means an spark application can't be done when it contains other unfinished non-daemon. for example {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} was (Author: hustfxj): [~srowen] sorry, I didn't say it clearly. I means an spark application can't be done when it contain other unfinished non-daemon. for example {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181 ] hustfxj commented on SPARK-19264: - [~srowen] sorry, I didn't say it clearly. I means an spark application can't be done when it contain other unfinished non-daemon. for example {code:title=Test.scala|borderStyle=solid} val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.streaming.blockInterval", "1000ms") val ssc = new StreamingContext(sparkConf, Seconds(10)) //non-daemon thread val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactory() { def newThread(r: Runnable): Thread = new Thread(r, "Driver-Commit-Thread") }) scheduledExecutorService.scheduleAtFixedRate( new Runnable() { def run() { try { System.out.println("runable") } catch { case e: Exception => { System.out.println("ScheduledTask persistAllConsumerOffset exception", e) } } } }, 1000, 1000 * 5, TimeUnit.MILLISECONDS) val lines = ssc.receiverStream(new WordReceiver(StorageLevel.MEMORY_AND_DISK_2)) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 10) wordCounts.foreachRDD{rdd => rdd.collect().foreach(println) throw new RuntimeException } ssc.start() ssc.awaitTermination() {code} > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19264) Work should start driver, the same to AM of yarn
[ https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hustfxj updated SPARK-19264: Description: I think work can't start driver by "ProcessBuilderLike", thus we can't know the application's main thread is finished or not if the application's main thread contains some non-daemon threads. Because the program terminates when there no longer is any non-daemon thread running (or someone called System.exit). The main thread can have finished long ago. worker should start driver like AM of YARN . As followed: {code:title=ApplicationMaster.scala|borderStyle=solid} mainMethod.invoke(null, userArgs.toArray) finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) logDebug("Done running users class") {code} Then the work can monitor the driver's main thread, and know the application's state. was: I think work can't start driver by "ProcessBuilderLike", thus we can't know the application's main thread is finished or not if the application's main thread contains some daemon threads. Because the program terminates when there no longer is any non-daemon thread running (or someone called System.exit). The main thread can have finished long ago. worker should start driver like AM of YARN . As followed: {code:title=ApplicationMaster.scala|borderStyle=solid} mainMethod.invoke(null, userArgs.toArray) finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) logDebug("Done running users class") {code} Then the work can monitor the driver's main thread, and know the application's state. > Work should start driver, the same to AM of yarn > --- > > Key: SPARK-19264 > URL: https://issues.apache.org/jira/browse/SPARK-19264 > Project: Spark > Issue Type: Improvement >Reporter: hustfxj > > I think work can't start driver by "ProcessBuilderLike", thus we can't > know the application's main thread is finished or not if the application's > main thread contains some non-daemon threads. Because the program terminates > when there no longer is any non-daemon thread running (or someone called > System.exit). The main thread can have finished long ago. > worker should start driver like AM of YARN . As followed: > {code:title=ApplicationMaster.scala|borderStyle=solid} > mainMethod.invoke(null, userArgs.toArray) > finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) > logDebug("Done running users class") > {code} > Then the work can monitor the driver's main thread, and know the > application's state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`
[ https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827174#comment-15827174 ] Apache Spark commented on SPARK-19227: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16591 > Typo in `org.apache.spark.internal.config.ConfigEntry` > --- > > Key: SPARK-19227 > URL: https://issues.apache.org/jira/browse/SPARK-19227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Biao Ma >Priority: Trivial > Labels: easyfix > > The parameter `defaultValue` is not exists in class > `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub > class `ConfigEntryWithDefault` and .Also there has some un used imports, > should we modify this class? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19267) Fix a race condition when stopping StateStore
[ https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19267: Assignee: Shixiong Zhu (was: Apache Spark) > Fix a race condition when stopping StateStore > - > > Key: SPARK-19267 > URL: https://issues.apache.org/jira/browse/SPARK-19267 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19267) Fix a race condition when stopping StateStore
[ https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827135#comment-15827135 ] Apache Spark commented on SPARK-19267: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16627 > Fix a race condition when stopping StateStore > - > > Key: SPARK-19267 > URL: https://issues.apache.org/jira/browse/SPARK-19267 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19267) Fix a race condition when stopping StateStore
[ https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19267: Assignee: Apache Spark (was: Shixiong Zhu) > Fix a race condition when stopping StateStore > - > > Key: SPARK-19267 > URL: https://issues.apache.org/jira/browse/SPARK-19267 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19267) Fix a race condition when stopping StateStore
[ https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19267: - Priority: Minor (was: Major) > Fix a race condition when stopping StateStore > - > > Key: SPARK-19267 > URL: https://issues.apache.org/jira/browse/SPARK-19267 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19267) Fix a race condition when stopping StateStore
[ https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19267: - Affects Version/s: 2.0.0 2.0.1 2.0.2 2.1.0 > Fix a race condition when stopping StateStore > - > > Key: SPARK-19267 > URL: https://issues.apache.org/jira/browse/SPARK-19267 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19267) Fix a race condition when stopping StateStore
Shixiong Zhu created SPARK-19267: Summary: Fix a race condition when stopping StateStore Key: SPARK-19267 URL: https://issues.apache.org/jira/browse/SPARK-19267 Project: Spark Issue Type: Bug Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17747) WeightCol support non-double datatypes
[ https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17747: -- Shepherd: Joseph K. Bradley Assignee: zhengruifeng Target Version/s: 2.2.0 > WeightCol support non-double datatypes > -- > > Key: SPARK-17747 > URL: https://issues.apache.org/jira/browse/SPARK-17747 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > WeightCol only support double type now, which should fit with other numeric > types, such as Int. > {code} > scala> df3.show(5) > +-++--+ > |label|features|weight| > +-++--+ > | 0.0|(692,[127,128,129...| 1| > | 1.0|(692,[158,159,160...| 1| > | 1.0|(692,[124,125,126...| 1| > | 1.0|(692,[152,153,154...| 1| > | 1.0|(692,[151,152,153...| 1| > +-++--+ > only showing top 5 rows > scala> val lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8) > lr: org.apache.spark.ml.classification.LogisticRegression = > logreg_ee0308a72919 > scala> val lrm = lr.fit(df3) > 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training > finished but the result is not converged because: max iterations reached > lrm: org.apache.spark.ml.classification.LogisticRegressionModel = > logreg_ee0308a72919 > scala> val lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight") > lr: org.apache.spark.ml.classification.LogisticRegression = > logreg_ced7579d5680 > scala> val lrm = lr.fit(df3) > 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed > 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92) > scala.MatchError: > [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])] > (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266) > at > org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Commented] (SPARK-19266) DiskStore does not encrypt serialized RDD data
[ https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827100#comment-15827100 ] Marcelo Vanzin commented on SPARK-19266: So I started down the path of looking at exactly what's going on here and it's a little more complicated than I first thought (and not as serious as I first thought, too). There are 3 cases I can see: * MEMORY_AND_DISK_SER with enough free memory: data is encrypted on memory. If the block is later evicted to disk, the data on disk is encrypted. * MEMORY_AND_DISK_SER without enough free memory: a chunk of the block may be stored in memory, encrypted, but whatever does not fit will be spilled to disk, unencrypted. This is in {{BlockManager.doPutIterator}} (see call to {{partiallySerializedValues.finishWritingToStream}}). * DISK_SER: data is encrypted on disk (see call to {{diskStore.put}} in {{BlockManager.doPutIterator}} for the deserialized case) So it seems the only gap is in the second case above, and it should be easy to fix. The read part should already be handled, but it would be nice to add unit tests. > DiskStore does not encrypt serialized RDD data > -- > > Key: SPARK-19266 > URL: https://issues.apache.org/jira/browse/SPARK-19266 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without > encrypting (or compressing) it. So any cached blocks that are evicted to disk > when using {{MEMORY_AND_DISK_SER}} will not be encrypted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827089#comment-15827089 ] Yicheng Luo commented on SPARK-12347: - I have a half-baked improvement on my branch (Taking some suggestions from the PR and use a YAML file to keep track a list of required arguments) What I need is to extract the arguments from the examples. But it will requires a lot amount of time since the way the arguments are laid out in the examples are really not homogeneous. > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()
[ https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-13721: -- Assignee: Bogdan Raducanu > Add support for LATERAL VIEW OUTER explode() > > > Key: SPARK-13721 > URL: https://issues.apache.org/jira/browse/SPARK-13721 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Hellstrom >Assignee: Bogdan Raducanu > Fix For: 2.2.0 > > > Hive supports the [LATERAL VIEW > OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews] > syntax to make sure that when an array is empty, the content from the outer > table is still returned. > Within Spark, this is currently only possible within the HiveContext and > executing HiveQL statements. It would be nice if the standard explode() > DataFrame method allows the same. A possible signature would be: > {code:scala} > explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = > false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827066#comment-15827066 ] Shivaram Venkataraman commented on SPARK-15799: --- The work on this is mostly complete - But you can put me down as a Shepherd - if required I can create more sub-tasks etc. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14567) Add instrumentation logs to MLlib training algorithms
[ https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14567. --- Resolution: Fixed Fix Version/s: 2.2.0 > Add instrumentation logs to MLlib training algorithms > - > > Key: SPARK-14567 > URL: https://issues.apache.org/jira/browse/SPARK-14567 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.2.0 > > > In order to debug performance issues when training mllib algorithms, > it is useful to log some metrics about the training dataset, the training > parameters, etc. > This ticket is an umbrella to add some simple logging messages to the most > common MLlib estimators. There should be no performance impact on the current > implementation, and the output is simply printed in the logs. > Here are some values that are of interest when debugging training tasks: > * number of features > * number of instances > * number of partitions > * number of classes > * input RDD/DF cache level > * hyper-parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()
[ https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-13721. --- Resolution: Fixed Fix Version/s: 2.2.0 > Add support for LATERAL VIEW OUTER explode() > > > Key: SPARK-13721 > URL: https://issues.apache.org/jira/browse/SPARK-13721 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Hellstrom > Fix For: 2.2.0 > > > Hive supports the [LATERAL VIEW > OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews] > syntax to make sure that when an array is empty, the content from the outer > table is still returned. > Within Spark, this is currently only possible within the HiveContext and > executing HiveQL statements. It would be nice if the standard explode() > DataFrame method allows the same. A possible signature would be: > {code:scala} > explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = > false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
[ https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18206. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15671 [https://github.com/apache/spark/pull/15671] > Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg > --- > > Key: SPARK-18206 > URL: https://issues.apache.org/jira/browse/SPARK-18206 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.2.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7146: - Target Version/s: (was: 2.2.0) > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10931: -- Shepherd: (was: Joseph K. Bradley) > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
[ https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7424: - Target Version/s: (was: 2.2.0) > spark.ml classification, regression abstractions should add metadata to > output column > - > > Key: SPARK-7424 > URL: https://issues.apache.org/jira/browse/SPARK-7424 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Update ClassificationModel, ProbabilisticClassificationModel prediction to > include numClasses in output column metadata. > Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14501) spark.ml parity for fpm - frequent items
[ https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827052#comment-15827052 ] Joseph K. Bradley commented on SPARK-14501: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? I'd really like to get this into 2.2 but don't have time to review it right now. Could another committer take it? > spark.ml parity for fpm - frequent items > > > Key: SPARK-14501 > URL: https://issues.apache.org/jira/browse/SPARK-14501 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley > > This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml. > I am initially creating a single subtask, which will require a brief design > doc for the DataFrame-based API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10931: -- Target Version/s: (was: 2.2.0) > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15571) Pipeline unit test improvements for 2.2
[ https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15571: -- Shepherd: Joseph K. Bradley > Pipeline unit test improvements for 2.2 > --- > > Key: SPARK-15571 > URL: https://issues.apache.org/jira/browse/SPARK-15571 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Reporter: Joseph K. Bradley > > Issue: > * There are several pieces of standard functionality shared by all > algorithms: Params, UIDs, fit/transform/save/load, etc. Currently, these > pieces are generally tested in ad hoc tests for each algorithm. > * This has led to inconsistent coverage, especially within the Python API. > Goal: > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality. > * Eliminate duplicate code. Improve test coverage. Simplify adding these > standard unit tests for future algorithms and APIs. > This will require several subtasks. If you identify an issue, please create > a subtask, or comment below if the issue needs to be discussed first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector
[ https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827041#comment-15827041 ] Joseph K. Bradley commented on SPARK-14659: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? > OneHotEncoder support drop first category alphabetically in the encoded > vector > --- > > Key: SPARK-14659 > URL: https://issues.apache.org/jira/browse/SPARK-14659 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > R formula drop the first category alphabetically when encode string/category > feature. Spark RFormula use OneHotEncoder to encode string/category feature > into vector, but only supporting "dropLast" by string/category frequencies. > This will cause SparkR produce different models compared with native R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14706) Python ML persistence integration test
[ https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14706: -- Target Version/s: (was: 2.2.0) > Python ML persistence integration test > -- > > Key: SPARK-14706 > URL: https://issues.apache.org/jira/browse/SPARK-14706 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Reporter: Joseph K. Bradley > > Goal: extend integration test in {{ml/tests.py}}. > In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}. > This issue includes: > * Extending {{_compare_pipelines}} to handle CrossValidator, > TrainValidationSplit, and OneVsRest > * Adding an integration test in PersistenceTest which includes nested > meta-algorithms. E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ > OneVsRest[ LogisticRegression ] ] ] ]}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827038#comment-15827038 ] Joseph K. Bradley commented on SPARK-15799: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827036#comment-15827036 ] Joseph K. Bradley commented on SPARK-16578: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18011) SparkR serialize "NA" throws exception
[ https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827032#comment-15827032 ] Miao Wang edited comment on SPARK-18011 at 1/17/17 11:27 PM: - [~felixcheung] I did intensive debug in both R side and scala side. On R side, I debugged `createDataFrame.default` and `parallelize`, which converts the data.frame into RDD and DataFrame. The code of turning the data into RDD is done in `parallelize`: sliceLen <- ceiling(length(coll) / numSlices) slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)]) serializedSlices <- lapply(slices, serialize, connection = NULL) I add debug message after the `serialize`: lapply(serializedSlices, function(`x`) {message(paste("unserialized ", unserialize(x)))}) The data `NA` is unserialized successfully. Then, the serialized data is transferred to Scala side by jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) and returns a handle of the RDD in `jrdd`, which is later used by `createDataFrame.default`. I did not find anything wrong here. On the Scala side, the problem happens in def readString(in: DataInputStream): String = { val len = in.readInt() <=== it encounters the problem when reading `NA` as a string. readStringBytes(in, len) } Then, I changed the logic as follows: def readString(in: DataInputStream): String = { var len = in.readInt() if (len < 0) { len = 3<= I enforce reading 3 bytes in this case, because I believe that it is the case of `NA` } readStringBytes(in, len) } Then I run the following commands in sparkR: > a <- as.Date(NA) > b <- as.data.frame(a) > c <- collect(select(createDataFrame(b), "*")) > c a 1 NA It executes correctly without hitting the exception handling (I add debug information in the handling logic. If it is hit, error message will be print on the console and I verified that it is print out without the above logic). So, we can conclude that the problem is caused by `serialize` function with my local R installation, which serialize `NA` as string without packing its length before the actual value. Since `unserialize` can decode the seralized data, this protocol should be by R design when handling `NA` as `Date` type. I don't find the source code of `serialize` in R source code, which calls Internal(serialize(object, connection, type, version, refhook)) For the fix, we can either leave it as it is by an exception handling or explicitly add a handling in readString when index is negative. What do you think? Thanks! was (Author: wm624): [~felixcheung] I did intensive debug in both R side and scala side. On R side, I debugged `createDataFrame.default` and `parallelize`, which converts the data.frame into RDD and DataFrame. The code of turning the data into RDD is done in `parallelize`: sliceLen <- ceiling(length(coll) / numSlices) slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)]) serializedSlices <- lapply(slices, serialize, connection = NULL) I add debug message after the `serialize`: lapply(serializedSlices, function(x) {message(paste("unserialized ", unserialize(x)))}) The data `NA` is unserialized successfully. Then, the serialized data is transferred to Scala side by jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) and returns a handle of the RDD in `jrdd`, which is later used by `createDataFrame.default`. I did not find anything wrong here. On the Scala side, the problem happens in def readString(in: DataInputStream): String = { val len = in.readInt() <=== it encounters the problem when reading `NA` as a string. readStringBytes(in, len) } Then, I changed the logic as follows: def readString(in: DataInputStream): String = { var len = in.readInt() if (len < 0) { len = 3<= I enforce reading 3 bytes in this case, because I believe that it is the case of `NA` } readStringBytes(in, len) } Then I run the following commands in sparkR: > a <- as.Date(NA) > b <- as.data.frame(a) > c <- collect(select(createDataFrame(b), "*")) > c a 1 NA It executes correctly without hitting the exception handling (I add debug information in the handling logic. If it is hit, error message will be print on the console and I verified that it is print out without the above logic). So, we can conclude that the problem is caused by `serialize` function with my local R installation, which serialize `NA` as string without packing its length before the actual value. Since `unserialize` can decode the seralized data, this protocol should be by R design when handling `NA` as `Date` type. I don't find the source code of `serialize` in R source code, which calls Internal(serialize(object, connection, type, version, refhook)) For the fix, we can either
[jira] [Commented] (SPARK-18011) SparkR serialize "NA" throws exception
[ https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827032#comment-15827032 ] Miao Wang commented on SPARK-18011: --- [~felixcheung] I did intensive debug in both R side and scala side. On R side, I debugged `createDataFrame.default` and `parallelize`, which converts the data.frame into RDD and DataFrame. The code of turning the data into RDD is done in `parallelize`: sliceLen <- ceiling(length(coll) / numSlices) slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)]) serializedSlices <- lapply(slices, serialize, connection = NULL) I add debug message after the `serialize`: lapply(serializedSlices, function(x) {message(paste("unserialized ", unserialize(x)))}) The data `NA` is unserialized successfully. Then, the serialized data is transferred to Scala side by jrdd <- callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, serializedSlices) and returns a handle of the RDD in `jrdd`, which is later used by `createDataFrame.default`. I did not find anything wrong here. On the Scala side, the problem happens in def readString(in: DataInputStream): String = { val len = in.readInt() <=== it encounters the problem when reading `NA` as a string. readStringBytes(in, len) } Then, I changed the logic as follows: def readString(in: DataInputStream): String = { var len = in.readInt() if (len < 0) { len = 3<= I enforce reading 3 bytes in this case, because I believe that it is the case of `NA` } readStringBytes(in, len) } Then I run the following commands in sparkR: > a <- as.Date(NA) > b <- as.data.frame(a) > c <- collect(select(createDataFrame(b), "*")) > c a 1 NA It executes correctly without hitting the exception handling (I add debug information in the handling logic. If it is hit, error message will be print on the console and I verified that it is print out without the above logic). So, we can conclude that the problem is caused by `serialize` function with my local R installation, which serialize `NA` as string without packing its length before the actual value. Since `unserialize` can decode the seralized data, this protocol should be by R design when handling `NA` as `Date` type. I don't find the source code of `serialize` in R source code, which calls Internal(serialize(object, connection, type, version, refhook)) For the fix, we can either leave it as it is by an exception handling or explicitly add a handling in readString when index is negative. What do you think? Thanks! > SparkR serialize "NA" throws exception > -- > > Key: SPARK-18011 > URL: https://issues.apache.org/jira/browse/SPARK-18011 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Miao Wang > > For some versions of R, if Date has "NA" field, backend will throw negative > index exception. > To reproduce the problem: > {code} > > a <- as.Date(c("2016-11-11", "NA")) > > b <- as.data.frame(a) > > c <- createDataFrame(b) > > dim(c) > 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NegativeArraySizeException > at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) > at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) > at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) > at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) > at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) > at > org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown >
[jira] [Created] (SPARK-19266) DiskStore does not encrypt serialized RDD data
Marcelo Vanzin created SPARK-19266: -- Summary: DiskStore does not encrypt serialized RDD data Key: SPARK-19266 URL: https://issues.apache.org/jira/browse/SPARK-19266 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Marcelo Vanzin {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without encrypting (or compressing) it. So any cached blocks that are evicted to disk when using {{MEMORY_AND_DISK_SER}} will not be encrypted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17455: -- Shepherd: Joseph K. Bradley > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary
[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827023#comment-15827023 ] Joseph K. Bradley commented on SPARK-18348: --- I'm doing a general pass to enforce the Shepherd requirement from the roadmap process in [SPARK-18813]. Note that the process is a new proposal and can be updated as needed. Is someone interested in shepherding this issue, or shall I remove the target version? > Improve tree ensemble model summary > --- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel errorxerror xstd > 1 0.17647059 0 1.000 1.000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.0100 4 0.7647059 1.0588235 0.2200975 > Variable importance > StartAge Number > 64 24 12 > Node number 1: 81 observations,complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts:6417 >probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age< 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts:56 6 >probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age< 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age< 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 811 >probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts:29 0 >probabilities: 1.000 0.000 > Node number 5: 33 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts:27 6 >probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age< 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts:12 0 >probabilities: 1.000 0.000 > Node number 11: 21 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts:15 6 >probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age< 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts:12 2 >probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 >probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Commented] (SPARK-18569) Support R formula arithmetic
[ https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827021#comment-15827021 ] Joseph K. Bradley commented on SPARK-18569: --- +1 for putting together a design doc for RFormula to help us consider these edge cases and decide on priorities for adding functionality. [~felixcheung] do you want to shepherd this for 2.2, or shall I remove the target? > Support R formula arithmetic > - > > Key: SPARK-18569 > URL: https://issues.apache.org/jira/browse/SPARK-18569 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Felix Cheung > > I think we should support arithmetic which makes it a lot more convenient to > build model. Something like > {code} > log(y) ~ a + log(x) > {code} > And to avoid resolution confusions we should support the I() operator: > {code} > I > I(X∗Z) as is: include a new variable consisting of these variables multiplied > {code} > Such that this works: > {code} > y ~ a + I(b+c) > {code} > the term b+c is to be interpreted as the sum of b and c. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18618) SparkR GLM model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827019#comment-15827019 ] Joseph K. Bradley commented on SPARK-18618: --- [~yanboliang] Will you be willing to shepherd this for the 2.2 release? > SparkR GLM model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > Labels: 2.2.0 > > SparkR GLM model {{predict}} should support {{type}} as a argument. This will > it consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores
[ https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18917. - Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 2.2.0 > Dataframe - Time Out Issues / Taking long time in append mode on object stores > -- > > Key: SPARK-18917 > URL: https://issues.apache.org/jira/browse/SPARK-18917 > Project: Spark > Issue Type: Improvement > Components: EC2, SQL, YARN >Affects Versions: 2.0.2, 2.1.0 >Reporter: Anbu Cheeralan >Assignee: Reynold Xin >Priority: Minor > Fix For: 2.2.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > When using Dataframe write in append mode on object stores (S3 / Google > Storage), the writes are taking long time to write/ getting read time out. > This is because dataframe.write lists all leaf folders in the target > directory. If there are lot of subfolders due to partitions, this is taking > for ever. > The code is In org.apache.spark.sql.execution.datasources.DataSource.write() > following code causes huge number of RPC calls when the file system is an > Object Store (S3, GS). > if (mode == SaveMode.Append) { > val existingPartitionColumns = Try { > resolveRelation() > .asInstanceOf[HadoopFsRelation] > .location > .partitionSpec() > .partitionColumns > .fieldNames > .toSeq > }.getOrElse(Seq.empty[String]) > There should be a flag to skip Partition Match Check in append mode. I can > work on the patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3162: - Target Version/s: (was: 2.2.0) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826997#comment-15826997 ] Joseph K. Bradley commented on SPARK-12347: --- I really want this to get in but just don't have time to work on it right now. I'm going to remove the target version. However, if another person can shepherd it, please feel free to reinstate that field. > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12347: -- Target Version/s: (was: 2.2.0) > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley >Priority: Critical > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs
[ https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18613: -- Shepherd: Joseph K. Bradley > spark.ml LDA classes should not expose spark.mllib in APIs > -- > > Key: SPARK-18613 > URL: https://issues.apache.org/jira/browse/SPARK-18613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it > should not: > * {{def oldLocalModel: OldLocalLDAModel}} > * {{def getModel: OldLDAModel}} > This task is to deprecate those methods. I recommend creating > {{private[ml]}} versions of the methods which are used internally in order to > avoid deprecation warnings. > Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826996#comment-15826996 ] Joseph K. Bradley commented on SPARK-18924: --- Per the 2.2 roadmap process, I'm going to add [~mengxr] as the shepherd, but others here are free to take that role instead. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18924: -- Shepherd: Xiangrui Meng > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
[ https://issues.apache.org/jira/browse/SPARK-19261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826982#comment-15826982 ] Apache Spark commented on SPARK-19261: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/16626 > Support `ALTER TABLE table_name ADD COLUMNS(..)` statement > -- > > Key: SPARK-19261 > URL: https://issues.apache.org/jira/browse/SPARK-19261 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: StanZhai > > We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which > already be used in version < 2.x. > This is very useful for those who want to upgrade there Spark version to 2.x. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17874) Additional SSL port on HistoryServer should be configurable
[ https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17874: Assignee: Apache Spark > Additional SSL port on HistoryServer should be configurable > --- > > Key: SPARK-17874 > URL: https://issues.apache.org/jira/browse/SPARK-17874 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.1 >Reporter: Andrew Ash >Assignee: Apache Spark > > When turning on SSL on the HistoryServer with > {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the > [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262] > result of calculating {{spark.history.ui.port + 400}}, and sets up a > redirect from the original (http) port to the new (https) port. > {noformat} > $ netstat -nlp | grep 23714 > (Not all processes could be identified, non-owned process info > will not be shown, you would have to be root to see it all.) > tcp0 0 :::18080:::* > LISTEN 23714/java > tcp0 0 :::18480:::* > LISTEN 23714/java > {noformat} > By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one > open port to change protocol from http to https, not to have 1) additional > ports open 2) the http port remain open 3) the additional port at a value I > didn't specify. > To fix this could take one of two approaches: > Approach 1: > - one port always, which is configured with {{spark.history.ui.port}} > - the protocol on that port is http by default > - or if {{spark.ssl.historyServer.enabled=true}} then it's https > Approach 2: > - add a new configuration item {{spark.history.ui.sslPort}} which configures > the second port that starts up > In approach 1 we probably need a way to specify to Spark jobs whether the > history server has ssl or not, based on SPARK-16988 > That makes me think we should go with approach 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17874) Additional SSL port on HistoryServer should be configurable
[ https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17874: Assignee: (was: Apache Spark) > Additional SSL port on HistoryServer should be configurable > --- > > Key: SPARK-17874 > URL: https://issues.apache.org/jira/browse/SPARK-17874 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.1 >Reporter: Andrew Ash > > When turning on SSL on the HistoryServer with > {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the > [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262] > result of calculating {{spark.history.ui.port + 400}}, and sets up a > redirect from the original (http) port to the new (https) port. > {noformat} > $ netstat -nlp | grep 23714 > (Not all processes could be identified, non-owned process info > will not be shown, you would have to be root to see it all.) > tcp0 0 :::18080:::* > LISTEN 23714/java > tcp0 0 :::18480:::* > LISTEN 23714/java > {noformat} > By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one > open port to change protocol from http to https, not to have 1) additional > ports open 2) the http port remain open 3) the additional port at a value I > didn't specify. > To fix this could take one of two approaches: > Approach 1: > - one port always, which is configured with {{spark.history.ui.port}} > - the protocol on that port is http by default > - or if {{spark.ssl.historyServer.enabled=true}} then it's https > Approach 2: > - add a new configuration item {{spark.history.ui.sslPort}} which configures > the second port that starts up > In approach 1 we probably need a way to specify to Spark jobs whether the > history server has ssl or not, based on SPARK-16988 > That makes me think we should go with approach 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org