[jira] [Updated] (SPARK-18243) Converge the insert path of Hive tables with data source tables

2017-01-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18243:

Assignee: Wenchen Fan

> Converge the insert path of Hive tables with data source tables
> ---
>
> Key: SPARK-18243
> URL: https://issues.apache.org/jira/browse/SPARK-18243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>
> Inserting data into Hive tables has its own implementation that is distinct 
> from data sources: InsertIntoHiveTable, SparkHiveWriterContainer and 
> SparkHiveDynamicPartitionWriterContainer.
> I think it should be possible to unify these with data source implementations 
> InsertIntoHadoopFsRelationCommand. We can start by implementing an 
> OutputWriterFactory/OutputWriter that uses Hive's serdes to write data.
> Note that one other major difference is that data source tables write 
> directly to the final destination without using some staging directory, and 
> then Spark itself adds the partitions/tables to the catalog. Hive tables 
> actually write to some staging directory, and then call Hive metastore's 
> loadPartition/loadTable function to load those data in.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18243) Converge the insert path of Hive tables with data source tables

2017-01-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18243.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16517
[https://github.com/apache/spark/pull/16517]

> Converge the insert path of Hive tables with data source tables
> ---
>
> Key: SPARK-18243
> URL: https://issues.apache.org/jira/browse/SPARK-18243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.2.0
>
>
> Inserting data into Hive tables has its own implementation that is distinct 
> from data sources: InsertIntoHiveTable, SparkHiveWriterContainer and 
> SparkHiveDynamicPartitionWriterContainer.
> I think it should be possible to unify these with data source implementations 
> InsertIntoHadoopFsRelationCommand. We can start by implementing an 
> OutputWriterFactory/OutputWriter that uses Hive's serdes to write data.
> Note that one other major difference is that data source tables write 
> directly to the final destination without using some staging directory, and 
> then Spark itself adds the partitions/tables to the catalog. Hive tables 
> actually write to some staging directory, and then call Hive metastore's 
> loadPartition/loadTable function to load those data in.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827547#comment-15827547
 ] 

Xiao Li commented on SPARK-19115:
-

Sorry for the late reply. This sounds reasonable. Does Hive support such a 
query?

> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupports the command " create external table if not exist  
> new_tbl like old_tbl location '/warehouse/new_tbl' "
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier locationSpec? 
> #createTableLike"
> modify method 'visitCreateTableLike' in scala file  'SparkSqlParser.scala' 
> and  update case class CreateTableLikeCommand in 'tables.scala' file
> finally  we compiled spark and replaced the jars as follow: 
> 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run  the command 
> 'create external table if not exist  new_tbl like old_tbl location 
> '/warehouse/new_tbl' successfully . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiaochen Ouyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang reopened SPARK-19115:
-

spark2.x unsupports the sql command: create external table if not exists 
gen_tbl like src_tbl location '/warehouse/gen_tbl'

> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupports the command " create external table if not exist  
> new_tbl like old_tbl location '/warehouse/new_tbl' "
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier locationSpec? 
> #createTableLike"
> modify method 'visitCreateTableLike' in scala file  'SparkSqlParser.scala' 
> and  update case class CreateTableLikeCommand in 'tables.scala' file
> finally  we compiled spark and replaced the jars as follow: 
> 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run  the command 
> 'create external table if not exist  new_tbl like old_tbl location 
> '/warehouse/new_tbl' successfully . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiaochen Ouyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-19115:

Description: 
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl location '/warehouse/new_tbl' "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier locationSpec?   
  #createTableLike"
modify method 'visitCreateTableLike' in scala file  'SparkSqlParser.scala' and  
update case class CreateTableLikeCommand in 'tables.scala' file
finally  we compiled spark and replaced the jars as follow: 
'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run  the command 
'create external table if not exist  new_tbl like old_tbl location 
'/warehouse/new_tbl' successfully . 

  was:
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl location '/warehouse/new_tbl' "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier locationSpec?   
  #createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .


> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupports the command " create external table if not exist  
> new_tbl like old_tbl location '/warehouse/new_tbl' "
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier locationSpec? 
> #createTableLike"
> modify method 'visitCreateTableLike' in scala file  'SparkSqlParser.scala' 
> and  update case class CreateTableLikeCommand in 'tables.scala' file
> finally  we compiled spark and replaced the jars as follow: 
> 'spark-catalyst-2.0.1.jar','spark-sql_2.11-2.0.1.jar', and run  the command 
> 'create external table if not exist  new_tbl like old_tbl location 
> '/warehouse/new_tbl' successfully . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiaochen Ouyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-19115:

Description: 
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl location '/warehouse/new_tbl' "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier locationSpec?   
  #createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .

  was:
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl location '/warehouse/new_tbl' "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .


> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupports the command " create external table if not exist  
> new_tbl like old_tbl location '/warehouse/new_tbl' "
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier locationSpec? 
> #createTableLike"
> and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
> ,after that,we found we can run command "create external table if not exist  
> new_tbl like old_tbl" successfully,unfortunately we found the generated 
> table's type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore 
> database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiaochen Ouyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-19115:

Description: 
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .

  was:
spark2.0.1 unsupported the command " create external table if not exist  
new_tbl like old_tbl"
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .


> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupports the command " create external table if not exist  
> new_tbl like old_tbl "
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
> ,after that,we found we can run command "create external table if not exist  
> new_tbl like old_tbl" successfully,unfortunately we found the generated 
> table's type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore 
> database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiaochen Ouyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-19115:

Description: 
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl location '/warehouse/new_tbl' "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .

  was:
spark2.0.1 unsupports the command " create external table if not exist  new_tbl 
like old_tbl "
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .


> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupports the command " create external table if not exist  
> new_tbl like old_tbl location '/warehouse/new_tbl' "
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
> ,after that,we found we can run command "create external table if not exist  
> new_tbl like old_tbl" successfully,unfortunately we found the generated 
> table's type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore 
> database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19115) SparkSQL unsupports the command " create external table if not exist new_tbl like old_tbl location '/warehouse/new_tbl' "

2017-01-17 Thread Xiaochen Ouyang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-19115:

Summary: SparkSQL unsupports the command " create external table if not 
exist  new_tbl like old_tbl location '/warehouse/new_tbl' "  (was: SparkSQL 
unsupports the command " create external table if not exist  new_tbl like 
old_tbl")

> SparkSQL unsupports the command " create external table if not exist  new_tbl 
> like old_tbl location '/warehouse/new_tbl' "
> --
>
> Key: SPARK-19115
> URL: https://issues.apache.org/jira/browse/SPARK-19115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1 hive1.2.1
>Reporter: Xiaochen Ouyang
>
> spark2.0.1 unsupported the command " create external table if not exist  
> new_tbl like old_tbl"
> we tried to modify the  sqlbase.g4 file,change
> "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> to
> "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
> LIKE source=tableIdentifier
> #createTableLike"
> and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
> ,after that,we found we can run command "create external table if not exist  
> new_tbl like old_tbl" successfully,unfortunately we found the generated 
> table's type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore 
> database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19270) Add summary table to GLM summary

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19270:


Assignee: (was: Apache Spark)

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Priority: Minor
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19270) Add summary table to GLM summary

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19270:


Assignee: Apache Spark

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19270) Add summary table to GLM summary

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827521#comment-15827521
 ] 

Apache Spark commented on SPARK-19270:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16630

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Priority: Minor
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2868) Support named accumulators in Python

2017-01-17 Thread Kyle Heath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827519#comment-15827519
 ] 

Kyle Heath commented on SPARK-2868:
---

Short version: Is there anything I can do to help bring this feature to 
pyspark?  
Long version: I've been implementing large jobs in pyspark for about 6 months.  
The ability to monitor named accumulators in the web-ui seems really important. 
 Running complex jobs at scale has been a bit like flying blind.  I'm new to 
this community, but want to help if I can.



> Support named accumulators in Python
> 
>
> Key: SPARK-2868
> URL: https://issues.apache.org/jira/browse/SPARK-2868
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Patrick Wendell
>
> SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to 
> make some additional changes. One potential path is to have a 1:1 
> correspondence with Scala accumulators (instead of a one-to-many). A 
> challenge is exposing the stringified values of the accumulators to the Scala 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-17 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827515#comment-15827515
 ] 

Kaushal Prajapati commented on SPARK-8480:
--

For example, Using SparkContext we can get all cached Rdds

{code:title=Code|borderStyle=solid}
scala> val rdd = sc.range(1,1000)
scala> rdd.setName("myRdd")
scala> rdd.cache
scala> rdd.count

scala> sc.getPersistentRDDs.foreach(println)
(9,kaushal MapPartitionsRDD[9] at rdd at :27)
(11,myRdd MapPartitionsRDD[11] at range at :27)

sc.getPersistentRDDs.filter(_._2.name == "myRdd").foreach(_._2.unpersist())

scala> sc.getPersistentRDDs.foreach(println)
(9,kaushal MapPartitionsRDD[9] at rdd at :27)
{code}

And we can unpersist any Rdd with valid name. 
Likewise same in DataSet, if we will able to list all cached DataSets with 
corresponding names then it will be good option to unpersist any DataSet using 
particular name.

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-19081.

   Resolution: Resolved
Fix Version/s: 1.5.0

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
> Fix For: 1.5.0
>
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19270) Add summary table to GLM summary

2017-01-17 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19270:
---

 Summary: Add summary table to GLM summary
 Key: SPARK-19270
 URL: https://issues.apache.org/jira/browse/SPARK-19270
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Wayne Zhang
Priority: Minor


Add R-like summary table to GLM summary, which includes feature name (if 
exist), parameter estimate, standard error, t-stat and p-value. This allows 
scala users to easily gather these commonly used inference results. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827510#comment-15827510
 ] 

Takeshi Yamamuro edited comment on SPARK-19081 at 1/18/17 6:54 AM:
---

Since the issue the reporter says has been handled in v1.5, I'll close this as 
resolved.


was (Author: maropu):
Since the issue the reporter says has been handled in v1.6, I'll close this as 
resolved.

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827510#comment-15827510
 ] 

Takeshi Yamamuro commented on SPARK-19081:
--

Since the issue the reporter says has been handled in v1.6, I'll close this as 
resolved.

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827506#comment-15827506
 ] 

Takeshi Yamamuro commented on SPARK-19081:
--

congrats!

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Davy Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827503#comment-15827503
 ] 

Davy Song commented on SPARK-19081:
---

@Takeshi, thanks very much. I have rewritten the udf which inherited GenericUDF 
to solve this problem, and now the script runs well. 

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19269) Scheduler Delay in spark ui

2017-01-17 Thread xdcjie (JIRA)
xdcjie created SPARK-19269:
--

 Summary: Scheduler Delay in spark ui
 Key: SPARK-19269
 URL: https://issues.apache.org/jira/browse/SPARK-19269
 Project: Spark
  Issue Type: Question
  Components: Web UI
Affects Versions: 2.0.2
Reporter: xdcjie


In sparkUI, SchedulerDelay is launchtime(in driver) - starttime (in executor), 
but in the code, that it consists of launchtime - starttime and statupdate 
(executor update status to driver time).
  
private[ui] def getSchedulerDelay(
  info: TaskInfo, metrics: TaskMetricsUIData, currentTime: Long): Long = {
if (info.finished) {
  val totalExecutionTime = info.finishTime - info.launchTime
  val executorOverhead = (metrics.executorDeserializeTime +
metrics.resultSerializationTime)
  math.max(
0,
totalExecutionTime - metrics.executorRunTime - executorOverhead -
  getGettingResultTime(info, currentTime))
} else {
  // The task is still running and the metrics like executorRunTime are not 
available.
  0L
}
  }

Could we split "scheduler delay" into "scheduler delay time 
(launchtime-starttime) and "status update time".  in my workspace, scheduler 
delay time is 60s,   but launchtime - starttime is very fast, <1s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18011) SparkR serialize "NA" throws exception

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827478#comment-15827478
 ] 

Felix Cheung commented on SPARK-18011:
--

very cool, thanks for all the investigation. What version of R you are running 
on? Is this on Mac or Linux?

> SparkR serialize "NA" throws exception
> --
>
> Key: SPARK-18011
> URL: https://issues.apache.org/jira/browse/SPARK-18011
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>
> For some versions of R, if Date has "NA" field, backend will throw negative 
> index exception.
> To reproduce the problem:
> {code}
> > a <- as.Date(c("2016-11-11", "NA"))
> > b <- as.data.frame(a)
> > c <- createDataFrame(b)
> > dim(c)
> 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NegativeArraySizeException
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
>   at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827475#comment-15827475
 ] 

Felix Cheung commented on SPARK-12347:
--

[~ethanlu...@gmail.com] would you be willing and able to spend more time on 
this? I would be interested in working with you to shepherd this for 2.2.

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827474#comment-15827474
 ] 

Felix Cheung commented on SPARK-18348:
--

[~yanboliang] would you like to run with this? you had a few great points when 
we discussed this earlier.

> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19066) SparkR LDA doesn't set optimizer correctly

2017-01-17 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19066:
-
Fix Version/s: 2.1.1

> SparkR LDA doesn't set optimizer correctly
> --
>
> Key: SPARK-19066
> URL: https://issues.apache.org/jira/browse/SPARK-19066
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.lda pass the optimizer "em" or "online" to the backend. However, 
> LDAWrapper doesn't set optimizer based on the value from R. Therefore, for 
> optimizer "em", the `isDistributed` field is FALSE, which should be TRUE.
> In addition, the `summary` method should bring back the results related to 
> `DistributedLDAModel`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19255) SQL Listener is causing out of memory, in case of large no of shuffle partition

2017-01-17 Thread Ashok Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827379#comment-15827379
 ] 

Ashok Kumar commented on SPARK-19255:
-

@Takeshi , thanks for looking into the issue. 
You are right, we can handle it by increasing driver memory, but our actual 
scenario is 10 million shuffle partition which is 100 times. 

After analysing the code it is found that when user looks for metrics via 
spark/sql ui we are merging all accumulator data and then displaying it. 
 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L336

So one suggestion is, how if we merge accumulator metrics after every 
configured number of task. In this case out of memory issue will not occur.  
Please suggest, looking forward for your input.

> SQL Listener is causing out of memory, in case of large no of shuffle 
> partition
> ---
>
> Key: SPARK-19255
> URL: https://issues.apache.org/jira/browse/SPARK-19255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
> Environment: Linux
>Reporter: Ashok Kumar
>Priority: Minor
> Attachments: spark_sqllistener_oom.png
>
>
> Test steps.
> 1.CREATE TABLE sample(imei string,age int,task bigint,num double,level 
> decimal(10,3),productdate timestamp,name string,point int)USING 
> com.databricks.spark.csv OPTIONS (path "data.csv", header "false", 
> inferSchema "false");
> 2. set spark.sql.shuffle.partitions=10;
> 3. select count(*) from (select task,sum(age) from sample group by task) t;
> After running above query, number of objects in map variable 
> _stageIdToStageMetrics has increase to very high number , this increment is 
> proportional to number of shuffle partition.
> Please have a look at attached screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827371#comment-15827371
 ] 

Genmao Yu commented on SPARK-19185:
---

[~c...@koeninger.org] I provide a PR to give more clear hint for users when 
encounter this problem.

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
> 

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19185:


Assignee: Apache Spark

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Apache Spark
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19185:


Assignee: (was: Apache Spark)

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at 

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827368#comment-15827368
 ] 

Apache Spark commented on SPARK-19185:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16629

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> 

[jira] [Commented] (SPARK-18627) Cannot connect to Hive metastore in client mode with proxy user

2017-01-17 Thread Mubashir Kazia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827347#comment-15827347
 ] 

Mubashir Kazia commented on SPARK-18627:


bq. it doesn't explain why that works in cluster mode.

Just a wild guess: it probably also falls back to GSSAPI auth, but it is 
successful because there is a yarn TGT available in the UGI/subject/ticket 
cache because of the AMDelegationTokenRenewer. I can't confirm this is the case 
because the HMS Audit logging does not log the real user. 

bq. In fact, client mode shouldn't even need delegation tokens for the HMS, 
although with a proxy user maybe things get a little messed up.
Agree.

> Cannot connect to Hive metastore in client mode with proxy user
> ---
>
> Key: SPARK-18627
> URL: https://issues.apache.org/jira/browse/SPARK-18627
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Marking as "minor" since the security story for client mode with proxy users 
> is a little sketchy to start with, but it shouldn't fail, at least not in 
> this manner. Error you get is:
> {noformat}
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed
> at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:430)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:240)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1528)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3238)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3257)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3482)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:225)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:209)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:332)
> at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293)
> at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:268)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
> {noformat}
> Cluster mode works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827332#comment-15827332
 ] 

Takeshi Yamamuro commented on SPARK-19081:
--

yea, great! Any action for this? If no, I'll make this issue resolved.

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827330#comment-15827330
 ] 

Genmao Yu commented on SPARK-19185:
---

IMHO, persisting can not cover all scene, just like executor failed or cached 
data loss. Maybe, we can expose much clearer hint, like "KafkaConsumer is not 
safe for multi-threaded access, you may set 'useConsumerCache' as false, or do 
not run multi-tasks in single executor" or something else. Besides, we may 
expose useConsumerCache with a configuration (true as default). This will not 
chang the behavior on a widespread basis.

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> 

[jira] [Commented] (SPARK-18627) Cannot connect to Hive metastore in client mode with proxy user

2017-01-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827328#comment-15827328
 ] 

Marcelo Vanzin commented on SPARK-18627:


bq. The title of this jira is also incorrect. It is successfully able to fetch 
HMS delegation token

Right, changed.

bq. Spark connects to and uses HMS not HS2. It fetches a delegation token for 
HMS...but stores it in the credential cache with the label for HS2

Although that does look fishy, it doesn't explain why that works in cluster 
mode. In fact, client mode shouldn't even need delegation tokens for the HMS, 
although with a proxy user maybe things get a little messed up.

> Cannot connect to Hive metastore in client mode with proxy user
> ---
>
> Key: SPARK-18627
> URL: https://issues.apache.org/jira/browse/SPARK-18627
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Marking as "minor" since the security story for client mode with proxy users 
> is a little sketchy to start with, but it shouldn't fail, at least not in 
> this manner. Error you get is:
> {noformat}
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed
> at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:430)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:240)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1528)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3238)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3257)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3482)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:225)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:209)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:332)
> at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293)
> at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:268)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
> {noformat}
> Cluster mode works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-19250) In security cluster, spark beeline connect to hive metastore failed

2017-01-17 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-19250:
-
Description: 
1. starting thriftserver in security mode, set hive.metastore.uris to hive 
metastore uri, also hive is in security mode.
2. when use beeline to create table, it can't connect to hive metastore 
successfully, occurs "Failed to find any Kerberos tgt".
{quote}
2017-01-17 16:25:53,618 | ERROR | [pool-25-thread-1] | SASL negotiation failure 
| org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:315)
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1738)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:513)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:249)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1533)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3119)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3138)
at 
org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:791)
at 
org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:755)
at 
org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1461)
at 
org.apache.hadoop.hive.ql.session.SessionState.getUserFromAuthenticator(SessionState.java:1014)
at 
org.apache.hadoop.hive.ql.metadata.Table.getEmptyTable(Table.java:177)
at org.apache.hadoop.hive.ql.metadata.Table.(Table.java:119)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$toHiveTable(HiveClientImpl.scala:803)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:430)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:430)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:430)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:284)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:429)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:229)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:191)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:191)
at 

[jira] [Updated] (SPARK-18627) Cannot connect to Hive metastore in client mode with proxy user

2017-01-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18627:
---
Summary: Cannot connect to Hive metastore in client mode with proxy user  
(was: Cannot fetch Hive delegation tokens in client mode with proxy user)

> Cannot connect to Hive metastore in client mode with proxy user
> ---
>
> Key: SPARK-18627
> URL: https://issues.apache.org/jira/browse/SPARK-18627
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Marking as "minor" since the security story for client mode with proxy users 
> is a little sketchy to start with, but it shouldn't fail, at least not in 
> this manner. Error you get is:
> {noformat}
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed
> at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
> at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:430)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:240)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1528)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3238)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3257)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3482)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:225)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:209)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:332)
> at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:293)
> at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:268)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
> {noformat}
> Cluster mode works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19266) Ensure DiskStore properly encrypts cached data on disk.

2017-01-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19266:
---
Summary: Ensure DiskStore properly encrypts cached data on disk.  (was: 
DiskStore does not encrypt serialized RDD data)

> Ensure DiskStore properly encrypts cached data on disk.
> ---
>
> Key: SPARK-19266
> URL: https://issues.apache.org/jira/browse/SPARK-19266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without 
> encrypting (or compressing) it. So any cached blocks that are evicted to disk 
> when using {{MEMORY_AND_DISK_SER}} will not be encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19250) In security cluster, spark beeline connect to hive metastore failed

2017-01-17 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827320#comment-15827320
 ] 

meiyoula commented on SPARK-19250:
--

[~rxin] I think you may understand this bug.
 I find u set hive.metastore.uris=“” to thriftsever.hiveConf in commit 
https://github.com/apache/spark/commit/054f991c4350af1350af7a4109ee77f4a34822f0#diff-709404b0d3defeff035ef0c4f5a960e5.
 But when  it set to local metastore, beeline openSession will not obtain token 
and connect remote metastore will failed.
Can u have a look and give som ideas? Thanks!



> In security cluster, spark beeline connect to hive metastore failed
> ---
>
> Key: SPARK-19250
> URL: https://issues.apache.org/jira/browse/SPARK-19250
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>  Labels: security-issue
>
> 1. starting thriftserver in security mode, set hive.metastore.uris to hive 
> metastore uri, also hive is in security mode.
> 2. when use beeline to create table, it can't connect to hive metastore 
> successfully, occurs "Failed to find any Kerberos tgt".
> Reason:
> When open hivemetastore client, first check if has token, because the 
> hive.metastore.uris has been set to local, so it don't obtain token; secondly 
> use tgt to auth, but current user is a proxyuser. So open metastore client 
> failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19266) DiskStore does not encrypt serialized RDD data

2017-01-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19266:
---
Priority: Minor  (was: Major)

> DiskStore does not encrypt serialized RDD data
> --
>
> Key: SPARK-19266
> URL: https://issues.apache.org/jira/browse/SPARK-19266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without 
> encrypting (or compressing) it. So any cached blocks that are evicted to disk 
> when using {{MEMORY_AND_DISK_SER}} will not be encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19266) DiskStore does not encrypt serialized RDD data

2017-01-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827315#comment-15827315
 ] 

Marcelo Vanzin commented on SPARK-19266:


Hmm... seems {{partiallySerializedValues.finishWritingToStream}} will actually 
keep writing encrypted data when it's called (it uses a 
{{RedirectableOutputStream}} which just replaces the byte array stream with a 
file stream, keeping the existing encryption and compression state). But I 
can't find where that spilled data is read, and I'm not sure whether that's 
properly covered by existing tests.

I'll try to write a test to exercise this path; worst case we have a new test 
that makes sure it works.

> DiskStore does not encrypt serialized RDD data
> --
>
> Key: SPARK-19266
> URL: https://issues.apache.org/jira/browse/SPARK-19266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without 
> encrypting (or compressing) it. So any cached blocks that are evicted to disk 
> when using {{MEMORY_AND_DISK_SER}} will not be encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19250) In security cluster, spark beeline connect to hive metastore failed

2017-01-17 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-19250:
-
Labels: security-issue  (was: )

> In security cluster, spark beeline connect to hive metastore failed
> ---
>
> Key: SPARK-19250
> URL: https://issues.apache.org/jira/browse/SPARK-19250
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>  Labels: security-issue
>
> 1. starting thriftserver in security mode, set hive.metastore.uris to hive 
> metastore uri, also hive is in security mode.
> 2. when use beeline to create table, it can't connect to hive metastore 
> successfully, occurs "Failed to find any Kerberos tgt".
> Reason:
> When open hivemetastore client, first check if has token, because the 
> hive.metastore.uris has been set to local, so it don't obtain token; secondly 
> use tgt to auth, but current user is a proxyuser. So open metastore client 
> failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero

2017-01-17 Thread Andrew MacKinlay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827236#comment-15827236
 ] 

Andrew MacKinlay edited comment on SPARK-19234 at 1/18/17 1:53 AM:
---

[~yanboliang] I presume that you are the author judging by Github commits? Do 
you have an opinion on this? 


was (Author: admackin):
[~yanbo] I presume that you are the author judging by Github commits? Do you 
have an opinion on this? 

> AFTSurvivalRegression chokes silently or with confusing errors when any 
> labels are zero
> ---
>
> Key: SPARK-19234
> URL: https://issues.apache.org/jira/browse/SPARK-19234
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark-shell or pyspark
>Reporter: Andrew MacKinlay
>Priority: Minor
> Attachments: spark-aft-failure.txt
>
>
> If you try and use AFTSurvivalRegression and any label in your input data is 
> 0.0, you get coefficients of 0.0 returned, and in many cases, errors like 
> this:
> {{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
> function evaluation. Decreasing step size to NaN}}
> Zero should, I think, be an allowed value for survival analysis. I don't know 
> if this is a pathological case for AFT specifically as I don't know enough 
> about it, but this behaviour is clearly undesirable. If you have any labels 
> of 0.0, you get either a) obscure error messages, with no knowledge of the 
> cause and coefficients which are all zero or b) no errors messages at all and 
> coefficients of zero (arguably worse, since you don't even have console 
> output to tell you something's gone awry). If AFT doesn't work with 
> zero-valued labels, Spark should fail fast and let the developer know why. If 
> it does, we should get results here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2017-01-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827237#comment-15827237
 ] 

Shivaram Venkataraman commented on SPARK-16578:
---

I think its fine to remove the target version for this - As [~mengxr] said its 
not clear what the requirements are for this deploy mode or what kind of 
applications will use this etc. If somebody puts that together we can retarget ?

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19234) AFTSurvivalRegression chokes silently or with confusing errors when any labels are zero

2017-01-17 Thread Andrew MacKinlay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827236#comment-15827236
 ] 

Andrew MacKinlay commented on SPARK-19234:
--

[~yanbo] I presume that you are the author judging by Github commits? Do you 
have an opinion on this? 

> AFTSurvivalRegression chokes silently or with confusing errors when any 
> labels are zero
> ---
>
> Key: SPARK-19234
> URL: https://issues.apache.org/jira/browse/SPARK-19234
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark-shell or pyspark
>Reporter: Andrew MacKinlay
>Priority: Minor
> Attachments: spark-aft-failure.txt
>
>
> If you try and use AFTSurvivalRegression and any label in your input data is 
> 0.0, you get coefficients of 0.0 returned, and in many cases, errors like 
> this:
> {{17/01/16 15:10:50 ERROR StrongWolfeLineSearch: Encountered bad values in 
> function evaluation. Decreasing step size to NaN}}
> Zero should, I think, be an allowed value for survival analysis. I don't know 
> if this is a pathological case for AFT specifically as I don't know enough 
> about it, but this behaviour is clearly undesirable. If you have any labels 
> of 0.0, you get either a) obscure error messages, with no knowledge of the 
> cause and coefficients which are all zero or b) no errors messages at all and 
> coefficients of zero (arguably worse, since you don't even have console 
> output to tell you something's gone awry). If AFT doesn't work with 
> zero-valued labels, Spark should fail fast and let the developer know why. If 
> it does, we should get results here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-01-17 Thread liyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyan updated SPARK-19268:
--
Description: 
bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 
subscribe topic03

when i run the spark example raises the following error:
{quote}
Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning 
task CleanBroadcast(4)
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
java.lang.IllegalStateException: Error reading delta file 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
HDFSStateStoreProvider[id = (op=0, part=2), dir = 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not 
exist
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 

[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827201#comment-15827201
 ] 

hustfxj edited comment on SPARK-19264 at 1/18/17 1:39 AM:
--

So I must change the last  code like that
{code}
try {
  ssc.awaitTermination()
} finally {
System.exit(-1);
  }
{code}

I think the user shouldn't care this issue.Since the spark provide the 
awaitTermination(), thus we should make the driver quit  the main thread done 
whether the driver still contains the unfinished non-daemon theads or not.



was (Author: hustfxj):
So I must change the last  code like that
{code}
try {
  ssc.awaitTermination()
} finally {
System.exit(-1);
  }
{code}

> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-01-17 Thread liyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyan updated SPARK-19268:
--
Description: 
bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 
subscribe topic03

when i run the spark example raises the following error:
{quote}
Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning 
task CleanBroadcast(4)
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
java.lang.IllegalStateException: Error reading delta file 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
HDFSStateStoreProvider[id = (op=0, part=2), dir = 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not 
exist
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 

[jira] [Updated] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-01-17 Thread liyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyan updated SPARK-19268:
--
Description: 
bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 
subscribe topic03

when i run the spark example raises the following error:
{quote}
Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning 
task CleanBroadcast(4)
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
java.lang.IllegalStateException: Error reading delta file 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
HDFSStateStoreProvider[id = (op=0, part=2), dir = 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not 
exist
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 

[jira] [Created] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-01-17 Thread liyan (JIRA)
liyan created SPARK-19268:
-

 Summary: File does not exist: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
 Key: SPARK-19268
 URL: https://issues.apache.org/jira/browse/SPARK-19268
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
 Environment: - hadoop2.7
- Java 7
Reporter: liyan


bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 192.168.3.110:9092 
subscribe topic03

when i run the spark example raises the following error:
{quote}
Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got cleaning 
task CleanBroadcast(4)
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
java.lang.IllegalStateException: Error reading delta file 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
HDFSStateStoreProvider[id = (op=0, part=2), dir = 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does not 
exist
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist: 
/tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587)
at 

[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-01-17 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827215#comment-15827215
 ] 

Helena Edelson commented on SPARK-19185:


I've seen this as well, the exceptions, as expected, are never raised if not 
using the cache. 
Spark 2.1.

The exception is raised in the seek function. I added an opt-in config for that 
temporarily but will work on a better solution. Perf hits aren't something I 
can do :)

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at 

[jira] [Commented] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827201#comment-15827201
 ] 

hustfxj commented on SPARK-19264:
-

So I must change the last  code like that
{code}
try {
  ssc.awaitTermination()
} finally {
System.exit(-1);
  }
{code}

> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19081) spark sql use HIVE UDF throw exception when return a Map value

2017-01-17 Thread Davy Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827199#comment-15827199
 ] 

Davy Song commented on SPARK-19081:
---

Thanks Takeshi, you are right. I test map on Databricks's spark-sql 2.0 and hit 
the exception as follows:

com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: 
org.apache.spark.sql.AnalysisException: Map type in java is unsupported because 
JVM type erasure makes spark fail to catch key and value types in Map<>; line 2 
pos 12 at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:234)
 at 
org.apache.spark.sql.hive.HiveSimpleUDF.javaClassToDataType(hiveUDFs.scala:41) 
at 
org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:71) 
at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:71) at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionBuilder$1.apply(HiveSessionCatalog.scala:126)
 at 
org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionBuilder$1.apply(HiveSessionCatalog.scala:122)
 at 
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:87)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:853)
 at 
org.apache.spark.sql.hive.HiveSessionCatalog.org$apache$spark$sql$hive$HiveSessionCatalog$$super$lookupFunction(HiveSessionCatalog.scala:186)
 at 

> spark sql use HIVE UDF throw exception when return a Map value
> --
>
> Key: SPARK-19081
> URL: https://issues.apache.org/jira/browse/SPARK-19081
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davy Song
>
> I have met a problem like https://issues.apache.org/jira/browse/SPARK-3582,
> but not with this parameter Map, my evaluate function return a Map:
> public Map evaluate(Text url) {...}
> when run spark-sql with this udf, getting the following exception:
> scala.MatchError: interface java.util.Map (of class java.lang.Class)
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:175)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:112)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:144)
> at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:133)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicOperators.scala:25)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicOperators.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable.resolved$lzycompute(basicOperators.scala:149)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181
 ] 

hustfxj edited comment on SPARK-19264 at 1/18/17 1:18 AM:
--

[~srowen] sorry, I didn't say it clearly. I means the driver can't be done when 
it contains other unfinished non-daemon threads. Look at the followed example. 
the driver program should crash due to the exception. But In fact the driver 
program can't crash because the timer threads still are runnning.
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}



was (Author: hustfxj):
[~srowen] sorry, I didn't say it clearly. I means the spark application can't 
be done when it contains other unfinished non-daemon. Look at the followed 
example. the driver program should crash due to the exception. But In fact the 
driver program can't crash because the timer threads still are runnning.
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}


> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181
 ] 

hustfxj edited comment on SPARK-19264 at 1/18/17 1:16 AM:
--

[~srowen] sorry, I didn't say it clearly. I means the spark application can't 
be done when it contains other unfinished non-daemon. Look at the followed 
example. the driver program should crash due to the exception. But In fact the 
driver program can't crash because the timer threads still are runnning.
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}



was (Author: hustfxj):
[~srowen] sorry, I didn't say it clearly. I means the spark application can't 
be done when it contains other unfinished non-daemon. Look at the follows 
example. the driver program should crash due to the exception. But In fact the 
driver program can't crash because the timer threads still are runnning.
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}


> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181
 ] 

hustfxj edited comment on SPARK-19264 at 1/18/17 1:16 AM:
--

[~srowen] sorry, I didn't say it clearly. I means the spark application can't 
be done when it contains other unfinished non-daemon. Look at the follows 
example. the driver program should crash due to the exception. But In fact the 
driver program can't crash because the timer threads still are runnning.
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}



was (Author: hustfxj):
[~srowen] sorry, I didn't say it clearly. I means an spark application can't be 
done when it contains other unfinished non-daemon. for example
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}



> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181
 ] 

hustfxj edited comment on SPARK-19264 at 1/18/17 1:12 AM:
--

[~srowen] sorry, I didn't say it clearly. I means an spark application can't be 
done when it contains other unfinished non-daemon. for example
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}




was (Author: hustfxj):
[~srowen] sorry, I didn't say it clearly. I means an spark application can't be 
done when it contain other unfinished non-daemon. for example
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}



> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827181#comment-15827181
 ] 

hustfxj commented on SPARK-19264:
-

[~srowen] sorry, I didn't say it clearly. I means an spark application can't be 
done when it contain other unfinished non-daemon. for example
{code:title=Test.scala|borderStyle=solid}   
   val sparkConf = new SparkConf().setAppName("NetworkWordCount")
sparkConf.set("spark.streaming.blockInterval", "1000ms")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//non-daemon thread
val scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(
  new ThreadFactory() {
def newThread(r: Runnable): Thread = new Thread(r, 
"Driver-Commit-Thread")
  })
scheduledExecutorService.scheduleAtFixedRate(
  new Runnable() {
def run() {
  try {
System.out.println("runable")
  } catch {
case e: Exception => {
  System.out.println("ScheduledTask persistAllConsumerOffset 
exception", e)
}
  }
}
  }, 1000, 1000 * 5, TimeUnit.MILLISECONDS)
val lines = ssc.receiverStream(new 
WordReceiver(StorageLevel.MEMORY_AND_DISK_2))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + 
y, 10)
wordCounts.foreachRDD{rdd =>
  rdd.collect().foreach(println)
  throw new RuntimeException
}
ssc.start()
ssc.awaitTermination()
{code}



> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-17 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19264:

Description: 
  I think work can't start driver by "ProcessBuilderLike",  thus we can't know 
the application's main thread is finished or not if the application's main 
thread contains some non-daemon threads. Because the program terminates when 
there no longer is any non-daemon thread running (or someone called 
System.exit). The main thread can have finished long ago. 
worker should  start driver like AM of YARN . As followed:

{code:title=ApplicationMaster.scala|borderStyle=solid}
 mainMethod.invoke(null, userArgs.toArray)
 finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
 logDebug("Done running users class")
{code}

Then the work can monitor the driver's main thread, and know the application's 
state. 

  was:
  I think work can't start driver by "ProcessBuilderLike",  thus we can't know 
the application's main thread is finished or not if the application's main 
thread contains some daemon threads. Because the program terminates when there 
no longer is any non-daemon thread running (or someone called System.exit). The 
main thread can have finished long ago. 
worker should  start driver like AM of YARN . As followed:

{code:title=ApplicationMaster.scala|borderStyle=solid}
 mainMethod.invoke(null, userArgs.toArray)
 finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
 logDebug("Done running users class")
{code}

Then the work can monitor the driver's main thread, and know the application's 
state. 


> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19227) Typo in `org.apache.spark.internal.config.ConfigEntry`

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827174#comment-15827174
 ] 

Apache Spark commented on SPARK-19227:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16591

> Typo  in `org.apache.spark.internal.config.ConfigEntry`
> ---
>
> Key: SPARK-19227
> URL: https://issues.apache.org/jira/browse/SPARK-19227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Biao Ma
>Priority: Trivial
>  Labels: easyfix
>
> The parameter `defaultValue` is not exists in class 
> `org.apache.spark.internal.config.ConfigEntry` but `_defaultValue` in its sub 
> class `ConfigEntryWithDefault` and .Also there has some un used imports, 
> should we modify this class?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19267:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix a race condition when stopping StateStore
> -
>
> Key: SPARK-19267
> URL: https://issues.apache.org/jira/browse/SPARK-19267
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827135#comment-15827135
 ] 

Apache Spark commented on SPARK-19267:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16627

> Fix a race condition when stopping StateStore
> -
>
> Key: SPARK-19267
> URL: https://issues.apache.org/jira/browse/SPARK-19267
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19267:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix a race condition when stopping StateStore
> -
>
> Key: SPARK-19267
> URL: https://issues.apache.org/jira/browse/SPARK-19267
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19267:
-
Priority: Minor  (was: Major)

> Fix a race condition when stopping StateStore
> -
>
> Key: SPARK-19267
> URL: https://issues.apache.org/jira/browse/SPARK-19267
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19267:
-
Affects Version/s: 2.0.0
   2.0.1
   2.0.2
   2.1.0

> Fix a race condition when stopping StateStore
> -
>
> Key: SPARK-19267
> URL: https://issues.apache.org/jira/browse/SPARK-19267
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-17 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19267:


 Summary: Fix a race condition when stopping StateStore
 Key: SPARK-19267
 URL: https://issues.apache.org/jira/browse/SPARK-19267
 Project: Spark
  Issue Type: Bug
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17747) WeightCol support non-double datatypes

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17747:
--
Shepherd: Joseph K. Bradley
Assignee: zhengruifeng
Target Version/s: 2.2.0

> WeightCol support non-double datatypes
> --
>
> Key: SPARK-17747
> URL: https://issues.apache.org/jira/browse/SPARK-17747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> WeightCol only support double type now, which should fit with other numeric 
> types, such as Int.
> {code}
> scala> df3.show(5)
> +-++--+
> |label|features|weight|
> +-++--+
> |  0.0|(692,[127,128,129...| 1|
> |  1.0|(692,[158,159,160...| 1|
> |  1.0|(692,[124,125,126...| 1|
> |  1.0|(692,[152,153,154...| 1|
> |  1.0|(692,[151,152,153...| 1|
> +-++--+
> only showing top 5 rows
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ee0308a72919
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training 
> finished but the result is not converged because: max iterations reached
> lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
> logreg_ee0308a72919
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ced7579d5680
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
> 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
> scala.MatchError: 
> [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
>  (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Commented] (SPARK-19266) DiskStore does not encrypt serialized RDD data

2017-01-17 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827100#comment-15827100
 ] 

Marcelo Vanzin commented on SPARK-19266:


So I started down the path of looking at exactly what's going on here and it's 
a little more complicated than I first thought (and not as serious as I first 
thought, too). There are 3 cases I can see:

* MEMORY_AND_DISK_SER with enough free memory: data is encrypted on memory. If 
the block is later evicted to disk, the data on disk is encrypted.

* MEMORY_AND_DISK_SER without enough free memory: a chunk of the block may be 
stored in memory, encrypted, but whatever does not fit will be spilled to disk, 
unencrypted. This is in {{BlockManager.doPutIterator}} (see call to 
{{partiallySerializedValues.finishWritingToStream}}).

* DISK_SER: data is encrypted on disk (see call to {{diskStore.put}} in 
{{BlockManager.doPutIterator}} for the deserialized case)

So it seems the only gap is in the second case above, and it should be easy to 
fix. The read part should already be handled, but it would be nice to add unit 
tests.

> DiskStore does not encrypt serialized RDD data
> --
>
> Key: SPARK-19266
> URL: https://issues.apache.org/jira/browse/SPARK-19266
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> {{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without 
> encrypting (or compressing) it. So any cached blocks that are evicted to disk 
> when using {{MEMORY_AND_DISK_SER}} will not be encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-17 Thread Yicheng Luo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827089#comment-15827089
 ] 

Yicheng Luo commented on SPARK-12347:
-

I have a half-baked improvement on my branch (Taking some suggestions from the 
PR and use a YAML file to keep track a list of required arguments) What I need 
is to extract the arguments from the examples. But it will requires a lot 
amount of time since the way the arguments are laid out in the examples are 
really not homogeneous.

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()

2017-01-17 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-13721:
--
Assignee: Bogdan Raducanu

> Add support for LATERAL VIEW OUTER explode()
> 
>
> Key: SPARK-13721
> URL: https://issues.apache.org/jira/browse/SPARK-13721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ian Hellstrom
>Assignee: Bogdan Raducanu
> Fix For: 2.2.0
>
>
> Hive supports the [LATERAL VIEW 
> OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews]
>  syntax to make sure that when an array is empty, the content from the outer 
> table is still returned. 
> Within Spark, this is currently only possible within the HiveContext and 
> executing HiveQL statements. It would be nice if the standard explode() 
> DataFrame method allows the same. A possible signature would be: 
> {code:scala}
> explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = 
> false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-01-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827066#comment-15827066
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

The work on this is mostly complete - But you can put me down as a Shepherd - 
if required I can create more sub-tasks etc.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14567.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.2.0
>
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()

2017-01-17 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-13721.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add support for LATERAL VIEW OUTER explode()
> 
>
> Key: SPARK-13721
> URL: https://issues.apache.org/jira/browse/SPARK-13721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ian Hellstrom
> Fix For: 2.2.0
>
>
> Hive supports the [LATERAL VIEW 
> OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews]
>  syntax to make sure that when an array is empty, the content from the outer 
> table is still returned. 
> Within Spark, this is currently only possible within the HiveContext and 
> executing HiveQL statements. It would be nice if the standard explode() 
> DataFrame method allows the same. A possible signature would be: 
> {code:scala}
> explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = 
> false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18206.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15671
[https://github.com/apache/spark/pull/15671]

> Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
> ---
>
> Key: SPARK-18206
> URL: https://issues.apache.org/jira/browse/SPARK-18206
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.2.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7146) Should ML sharedParams be a public API?

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7146:
-
Target Version/s:   (was: 2.2.0)

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10931:
--
Shepherd:   (was: Joseph K. Bradley)

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7424:
-
Target Version/s:   (was: 2.2.0)

> spark.ml classification, regression abstractions should add metadata to 
> output column
> -
>
> Key: SPARK-7424
> URL: https://issues.apache.org/jira/browse/SPARK-7424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Update ClassificationModel, ProbabilisticClassificationModel prediction to 
> include numClasses in output column metadata.
> Update RegressionModel to specify output column metadata as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14501) spark.ml parity for fpm - frequent items

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827052#comment-15827052
 ] 

Joseph K. Bradley commented on SPARK-14501:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

I'd really like to get this into 2.2 but don't have time to review it right 
now.  Could another committer take it?

> spark.ml parity for fpm - frequent items
> 
>
> Key: SPARK-14501
> URL: https://issues.apache.org/jira/browse/SPARK-14501
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml.
> I am initially creating a single subtask, which will require a brief design 
> doc for the DataFrame-based API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10931:
--
Target Version/s:   (was: 2.2.0)

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15571) Pipeline unit test improvements for 2.2

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15571:
--
Shepherd: Joseph K. Bradley

> Pipeline unit test improvements for 2.2
> ---
>
> Key: SPARK-15571
> URL: https://issues.apache.org/jira/browse/SPARK-15571
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Issue:
> * There are several pieces of standard functionality shared by all 
> algorithms: Params, UIDs, fit/transform/save/load, etc.  Currently, these 
> pieces are generally tested in ad hoc tests for each algorithm.
> * This has led to inconsistent coverage, especially within the Python API.
> Goal:
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality.
> * Eliminate duplicate code.  Improve test coverage.  Simplify adding these 
> standard unit tests for future algorithms and APIs.
> This will require several subtasks.  If you identify an issue, please create 
> a subtask, or comment below if the issue needs to be discussed first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827041#comment-15827041
 ] 

Joseph K. Bradley commented on SPARK-14659:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

> OneHotEncoder support drop first category alphabetically in the encoded 
> vector 
> ---
>
> Key: SPARK-14659
> URL: https://issues.apache.org/jira/browse/SPARK-14659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> R formula drop the first category alphabetically when encode string/category 
> feature. Spark RFormula use OneHotEncoder to encode string/category feature 
> into vector, but only supporting "dropLast" by string/category frequencies. 
> This will cause SparkR produce different models compared with native R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14706) Python ML persistence integration test

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14706:
--
Target Version/s:   (was: 2.2.0)

> Python ML persistence integration test
> --
>
> Key: SPARK-14706
> URL: https://issues.apache.org/jira/browse/SPARK-14706
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Goal: extend integration test in {{ml/tests.py}}.
> In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}.  
> This issue includes:
> * Extending {{_compare_pipelines}} to handle CrossValidator, 
> TrainValidationSplit, and OneVsRest
> * Adding an integration test in PersistenceTest which includes nested 
> meta-algorithms.  E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ 
> OneVsRest[ LogisticRegression ] ] ] ]}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827038#comment-15827038
 ] 

Joseph K. Bradley commented on SPARK-15799:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827036#comment-15827036
 ] 

Joseph K. Bradley commented on SPARK-16578:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18011) SparkR serialize "NA" throws exception

2017-01-17 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827032#comment-15827032
 ] 

Miao Wang edited comment on SPARK-18011 at 1/17/17 11:27 PM:
-

[~felixcheung] I did intensive debug in both R side and scala side.

On R side, I debugged `createDataFrame.default` and `parallelize`, which 
converts the data.frame into RDD and DataFrame. The code of turning the data 
into RDD is done in `parallelize`:
 sliceLen <- ceiling(length(coll) / numSlices)
 slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)])
 serializedSlices <- lapply(slices, serialize, connection = NULL)

I add debug message after the `serialize`:
lapply(serializedSlices, function(`x`) {message(paste("unserialized ", 
unserialize(x)))})

The data `NA` is unserialized successfully. 

Then, the serialized data is transferred to Scala side by jrdd <- 
callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, 
serializedSlices) and returns a handle of the RDD in `jrdd`, which is later 
used by `createDataFrame.default`.

I did not find anything wrong here.

On the Scala side, the problem happens in 

def readString(in: DataInputStream): String = {
val len = in.readInt() <=== it encounters the problem when reading `NA` as 
a string.
readStringBytes(in, len)
  }

Then, I changed the logic as follows:
 def readString(in: DataInputStream): String = {
var len = in.readInt()
if (len < 0) {
  len = 3<= I enforce reading 3 bytes in this case, because I believe 
that it is the case of `NA`
}
readStringBytes(in, len)
  }

Then I run the following commands in sparkR:

> a <- as.Date(NA)
> b <- as.data.frame(a)
> c <- collect(select(createDataFrame(b), "*"))
> c
   a
1 NA
It executes correctly without hitting the exception handling (I add debug 
information in the handling logic. If it is hit, error message will be print on 
the console and I verified that it is print out without the above logic).

So, we can conclude that the problem is caused by `serialize` function with my 
local R installation, which serialize `NA` as string without packing its length 
before the actual value. Since `unserialize` can decode the seralized data, 
this protocol should be by R design when handling `NA` as `Date` type. I don't 
find the source code of `serialize` in R source code, which calls 
Internal(serialize(object, connection, type, version, refhook))

For the fix, we can either leave it as it is by an exception handling or 
explicitly add a handling in readString when index is negative.

What do you think? Thanks!  



was (Author: wm624):
[~felixcheung] I did intensive debug in both R side and scala side.

On R side, I debugged `createDataFrame.default` and `parallelize`, which 
converts the data.frame into RDD and DataFrame. The code of turning the data 
into RDD is done in `parallelize`:
 sliceLen <- ceiling(length(coll) / numSlices)
 slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)])
 serializedSlices <- lapply(slices, serialize, connection = NULL)

I add debug message after the `serialize`:
lapply(serializedSlices, function(x) {message(paste("unserialized ", 
unserialize(x)))})

The data `NA` is unserialized successfully. 

Then, the serialized data is transferred to Scala side by jrdd <- 
callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, 
serializedSlices) and returns a handle of the RDD in `jrdd`, which is later 
used by `createDataFrame.default`.

I did not find anything wrong here.

On the Scala side, the problem happens in 

def readString(in: DataInputStream): String = {
val len = in.readInt() <=== it encounters the problem when reading `NA` as 
a string.
readStringBytes(in, len)
  }

Then, I changed the logic as follows:
 def readString(in: DataInputStream): String = {
var len = in.readInt()
if (len < 0) {
  len = 3<= I enforce reading 3 bytes in this case, because I believe 
that it is the case of `NA`
}
readStringBytes(in, len)
  }

Then I run the following commands in sparkR:

> a <- as.Date(NA)
> b <- as.data.frame(a)
> c <- collect(select(createDataFrame(b), "*"))
> c
   a
1 NA
It executes correctly without hitting the exception handling (I add debug 
information in the handling logic. If it is hit, error message will be print on 
the console and I verified that it is print out without the above logic).

So, we can conclude that the problem is caused by `serialize` function with my 
local R installation, which serialize `NA` as string without packing its length 
before the actual value. Since `unserialize` can decode the seralized data, 
this protocol should be by R design when handling `NA` as `Date` type. I don't 
find the source code of `serialize` in R source code, which calls 
Internal(serialize(object, connection, type, version, refhook))

For the fix, we can either 

[jira] [Commented] (SPARK-18011) SparkR serialize "NA" throws exception

2017-01-17 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827032#comment-15827032
 ] 

Miao Wang commented on SPARK-18011:
---

[~felixcheung] I did intensive debug in both R side and scala side.

On R side, I debugged `createDataFrame.default` and `parallelize`, which 
converts the data.frame into RDD and DataFrame. The code of turning the data 
into RDD is done in `parallelize`:
 sliceLen <- ceiling(length(coll) / numSlices)
 slices <- split(coll, rep(1: (numSlices + 1), each = sliceLen)[1:length(coll)])
 serializedSlices <- lapply(slices, serialize, connection = NULL)

I add debug message after the `serialize`:
lapply(serializedSlices, function(x) {message(paste("unserialized ", 
unserialize(x)))})

The data `NA` is unserialized successfully. 

Then, the serialized data is transferred to Scala side by jrdd <- 
callJStatic("org.apache.spark.api.r.RRDD", "createRDDFromArray", sc, 
serializedSlices) and returns a handle of the RDD in `jrdd`, which is later 
used by `createDataFrame.default`.

I did not find anything wrong here.

On the Scala side, the problem happens in 

def readString(in: DataInputStream): String = {
val len = in.readInt() <=== it encounters the problem when reading `NA` as 
a string.
readStringBytes(in, len)
  }

Then, I changed the logic as follows:
 def readString(in: DataInputStream): String = {
var len = in.readInt()
if (len < 0) {
  len = 3<= I enforce reading 3 bytes in this case, because I believe 
that it is the case of `NA`
}
readStringBytes(in, len)
  }

Then I run the following commands in sparkR:

> a <- as.Date(NA)
> b <- as.data.frame(a)
> c <- collect(select(createDataFrame(b), "*"))
> c
   a
1 NA
It executes correctly without hitting the exception handling (I add debug 
information in the handling logic. If it is hit, error message will be print on 
the console and I verified that it is print out without the above logic).

So, we can conclude that the problem is caused by `serialize` function with my 
local R installation, which serialize `NA` as string without packing its length 
before the actual value. Since `unserialize` can decode the seralized data, 
this protocol should be by R design when handling `NA` as `Date` type. I don't 
find the source code of `serialize` in R source code, which calls 
Internal(serialize(object, connection, type, version, refhook))

For the fix, we can either leave it as it is by an exception handling or 
explicitly add a handling in readString when index is negative.

What do you think? Thanks!  


> SparkR serialize "NA" throws exception
> --
>
> Key: SPARK-18011
> URL: https://issues.apache.org/jira/browse/SPARK-18011
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>
> For some versions of R, if Date has "NA" field, backend will throw negative 
> index exception.
> To reproduce the problem:
> {code}
> > a <- as.Date(c("2016-11-11", "NA"))
> > b <- as.data.frame(a)
> > c <- createDataFrame(b)
> > dim(c)
> 16/10/19 10:31:24 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NegativeArraySizeException
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119)
>   at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.Range.foreach(Range.scala:160)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at 
> org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  

[jira] [Created] (SPARK-19266) DiskStore does not encrypt serialized RDD data

2017-01-17 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-19266:
--

 Summary: DiskStore does not encrypt serialized RDD data
 Key: SPARK-19266
 URL: https://issues.apache.org/jira/browse/SPARK-19266
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin


{{DiskStore.putBytes()}} writes serialized RDD data directly to disk, without 
encrypting (or compressing) it. So any cached blocks that are evicted to disk 
when using {{MEMORY_AND_DISK_SER}} will not be encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17455:
--
Shepherd: Joseph K. Bradley

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827023#comment-15827023
 ] 

Joseph K. Bradley commented on SPARK-18348:
---

I'm doing a general pass to enforce the Shepherd requirement from the roadmap 
process in [SPARK-18813].  Note that the process is a new proposal and can be 
updated as needed.

Is someone interested in shepherding this issue, or shall I remove the target 
version?


> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (SPARK-18569) Support R formula arithmetic

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827021#comment-15827021
 ] 

Joseph K. Bradley commented on SPARK-18569:
---

+1 for putting together a design doc for RFormula to help us consider these 
edge cases and decide on priorities for adding functionality.  [~felixcheung] 
do you want to shepherd this for 2.2, or shall I remove the target?

> Support R formula arithmetic 
> -
>
> Key: SPARK-18569
> URL: https://issues.apache.org/jira/browse/SPARK-18569
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> I think we should support arithmetic which makes it a lot more convenient to 
> build model. Something like
> {code}
>   log(y) ~ a + log(x)
> {code}
> And to avoid resolution confusions we should support the I() operator:
> {code}
> I
>  I(X∗Z) as is: include a new variable consisting of these variables multiplied
> {code}
> Such that this works:
> {code}
> y ~ a + I(b+c)
> {code}
> the term b+c is to be interpreted as the sum of b and c.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18618) SparkR GLM model predict should support type as a argument

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827019#comment-15827019
 ] 

Joseph K. Bradley commented on SPARK-18618:
---

[~yanboliang] Will you be willing to shepherd this for the 2.2 release?

> SparkR GLM model predict should support type as a argument
> --
>
> Key: SPARK-18618
> URL: https://issues.apache.org/jira/browse/SPARK-18618
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>  Labels: 2.2.0
>
> SparkR GLM model {{predict}} should support {{type}} as a argument. This will 
> it consistent with native R predict such as 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18917) Dataframe - Time Out Issues / Taking long time in append mode on object stores

2017-01-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18917.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.2.0

> Dataframe - Time Out Issues / Taking long time in append mode on object stores
> --
>
> Key: SPARK-18917
> URL: https://issues.apache.org/jira/browse/SPARK-18917
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SQL, YARN
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Anbu Cheeralan
>Assignee: Reynold Xin
>Priority: Minor
> Fix For: 2.2.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> When using Dataframe write in append mode on object stores (S3 / Google 
> Storage), the writes are taking long time to write/ getting read time out. 
> This is because dataframe.write lists all leaf folders in the target 
> directory. If there are lot of subfolders due to partitions, this is taking 
> for ever.
> The code is In org.apache.spark.sql.execution.datasources.DataSource.write() 
> following code causes huge number of RPC calls when the file system is an 
> Object Store (S3, GS).
> if (mode == SaveMode.Append) {
> val existingPartitionColumns = Try {
> resolveRelation()
> .asInstanceOf[HadoopFsRelation]
> .location
> .partitionSpec()
> .partitionColumns
> .fieldNames
> .toSeq
> }.getOrElse(Seq.empty[String])
> There should be a flag to skip Partition Match Check in append mode. I can 
> work on the patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3162) Train DecisionTree locally when possible

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3162:
-
Target Version/s:   (was: 2.2.0)

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826997#comment-15826997
 ] 

Joseph K. Bradley commented on SPARK-12347:
---

I really want this to get in but just don't have time to work on it right now.  
I'm going to remove the target version.  However, if another person can 
shepherd it, please feel free to reinstate that field.

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12347:
--
Target Version/s:   (was: 2.2.0)

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18613:
--
Shepherd: Joseph K. Bradley

> spark.ml LDA classes should not expose spark.mllib in APIs
> --
>
> Key: SPARK-18613
> URL: https://issues.apache.org/jira/browse/SPARK-18613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it 
> should not:
> * {{def oldLocalModel: OldLocalLDAModel}}
> * {{def getModel: OldLDAModel}}
> This task is to deprecate those methods.  I recommend creating 
> {{private[ml]}} versions of the methods which are used internally in order to 
> avoid deprecation warnings.
> Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826996#comment-15826996
 ] 

Joseph K. Bradley commented on SPARK-18924:
---

Per the 2.2 roadmap process, I'm going to add [~mengxr] as the shepherd, but 
others here are free to take that role instead.

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18924:
--
Shepherd: Xiangrui Meng

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19261) Support `ALTER TABLE table_name ADD COLUMNS(..)` statement

2017-01-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826982#comment-15826982
 ] 

Apache Spark commented on SPARK-19261:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/16626

> Support `ALTER TABLE table_name ADD COLUMNS(..)` statement
> --
>
> Key: SPARK-19261
> URL: https://issues.apache.org/jira/browse/SPARK-19261
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: StanZhai
>
> We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which 
> already be used in version < 2.x.
> This is very useful for those who want to upgrade there Spark version to 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17874) Additional SSL port on HistoryServer should be configurable

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17874:


Assignee: Apache Spark

> Additional SSL port on HistoryServer should be configurable
> ---
>
> Key: SPARK-17874
> URL: https://issues.apache.org/jira/browse/SPARK-17874
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.1
>Reporter: Andrew Ash
>Assignee: Apache Spark
>
> When turning on SSL on the HistoryServer with 
> {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
> [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
>  result of calculating {{spark.history.ui.port + 400}}, and sets up a 
> redirect from the original (http) port to the new (https) port.
> {noformat}
> $ netstat -nlp | grep 23714
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> tcp0  0 :::18080:::*
> LISTEN  23714/java
> tcp0  0 :::18480:::*
> LISTEN  23714/java
> {noformat}
> By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
> open port to change protocol from http to https, not to have 1) additional 
> ports open 2) the http port remain open 3) the additional port at a value I 
> didn't specify.
> To fix this could take one of two approaches:
> Approach 1:
> - one port always, which is configured with {{spark.history.ui.port}}
> - the protocol on that port is http by default
> - or if {{spark.ssl.historyServer.enabled=true}} then it's https
> Approach 2:
> - add a new configuration item {{spark.history.ui.sslPort}} which configures 
> the second port that starts up
> In approach 1 we probably need a way to specify to Spark jobs whether the 
> history server has ssl or not, based on SPARK-16988
> That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17874) Additional SSL port on HistoryServer should be configurable

2017-01-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17874:


Assignee: (was: Apache Spark)

> Additional SSL port on HistoryServer should be configurable
> ---
>
> Key: SPARK-17874
> URL: https://issues.apache.org/jira/browse/SPARK-17874
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.1
>Reporter: Andrew Ash
>
> When turning on SSL on the HistoryServer with 
> {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
> [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
>  result of calculating {{spark.history.ui.port + 400}}, and sets up a 
> redirect from the original (http) port to the new (https) port.
> {noformat}
> $ netstat -nlp | grep 23714
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> tcp0  0 :::18080:::*
> LISTEN  23714/java
> tcp0  0 :::18480:::*
> LISTEN  23714/java
> {noformat}
> By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
> open port to change protocol from http to https, not to have 1) additional 
> ports open 2) the http port remain open 3) the additional port at a value I 
> didn't specify.
> To fix this could take one of two approaches:
> Approach 1:
> - one port always, which is configured with {{spark.history.ui.port}}
> - the protocol on that port is http by default
> - or if {{spark.ssl.historyServer.enabled=true}} then it's https
> Approach 2:
> - add a new configuration item {{spark.history.ui.sslPort}} which configures 
> the second port that starts up
> In approach 1 we probably need a way to specify to Spark jobs whether the 
> history server has ssl or not, based on SPARK-16988
> That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >