date:20170727

[jira] [Commented] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

2017-07-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104472#comment-16104472
 ] 

Hyukjin Kwon commented on SPARK-21554:
--

Would you mind describing steps to reproduce this?

> Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: 
> XXX' when run on yarn cluster
> --
>
> Key: SPARK-21554
> URL: https://issues.apache.org/jira/browse/SPARK-21554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: We are deploying pyspark scripts on EMR 5.7
>Reporter: Subhod Lagade
>
> Traceback (most recent call last):
>   File "Test.py", line 7, in 
> hc = HiveContext(sc)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py",
>  line 514, in __init__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py",
>  line 179, in getOrCreate
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py",
>  line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

2017-07-27 Thread Subhod Lagade (JIRA)

Subhod Lagade created SPARK-21554:
-

 Summary: Spark Hive reporting pyspark.sql.utils.AnalysisException: 
u'Table not found: XXX' when run on yarn cluster
 Key: SPARK-21554
 URL: https://issues.apache.org/jira/browse/SPARK-21554
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.1.1
 Environment: We are deploying pyspark scripts on EMR 5.7
Reporter: Subhod Lagade


Traceback (most recent call last):
  File "Test.py", line 7, in 
hc = HiveContext(sc)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py",
 line 514, in __init__
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py",
 line 179, in getOrCreate
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py",
 line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21553) Added the description of the default value of master parameter in the spark-shell

2017-07-27 Thread Donghui Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104424#comment-16104424
 ] 

Donghui Xu commented on SPARK-21553:


Please review the code on 
[https://github.com/apache/spark/pull/18755|https://github.com/apache/spark/pull/18755]

> Added the description of the default value of master parameter in the 
> spark-shell
> -
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Priority: Minor
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21548) Support insert into serial columns of table

2017-07-27 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104420#comment-16104420
 ] 

Takeshi Yamamuro commented on SPARK-21548:
--

It makes some sense to me cuz this is SQL-compliant: 
http://developer.mimer.com/documentation/html_101/Mimer_SQL_Engine_DocSet/SQL_Statements65.html.
Could you make a pr (we better discuss the implementation in github, maybe I 
think)?

> Support insert into serial columns of table
> ---
>
> Key: SPARK-21548
> URL: https://issues.apache.org/jira/browse/SPARK-21548
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: LvDongrong
>
> When we use the 'insert into ...' statement we can only insert all the 
> columns into table.But int some cases,our table has many columns and we are 
> only interest in some of them.So we want to support the statement "insert 
> into table tbl (column1, column2,...) values (value1, value2, value3,...)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21509) Add a config to enable adaptive query execution only for the last query execution.

2017-07-27 Thread jin xing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing closed SPARK-21509.

Resolution: Won't Fix

>  Add a config to enable adaptive query execution only for the last query 
> execution.
> ---
>
> Key: SPARK-21509
> URL: https://issues.apache.org/jira/browse/SPARK-21509
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: jin xing
>
> Feature of adaptive query execution is a good way to avoid generating too 
> many small files on HDFS, like mentioned in SPARK-16188.
> When feature of adaptive query execution is enabled, all shuffles will be 
> coordinated. The drawbacks:
> 1. It's hard to balance the num of reducers(this decides the processing 
> speed) and file size on HDFS
> 2. It generates some unnecessary 
> shuffles(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L101)
> 3. It generates lots of jobs, which have extra cost for scheduling.
> We can add a config and enable adaptive query execution only for the last 
> shuffle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2017-07-27 Thread Subhod Lagade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104418#comment-16104418
 ] 

Subhod Lagade commented on SPARK-15345:
---

Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: 
XXX' when run on yarn cluster. We are still facing this issue with spark 2.1.1 
any update on this is this resolved?

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no

[jira] [Resolved] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.

2017-07-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21325.
--
Resolution: Invalid

The affected version you set is 2.3.0 and that looks fixed in the master and 
the description is not clear to me as well. I am resolving this as invalid for 
now.

> The shell of 'spark-submit' about '--jars' and '--fils', jars and files can 
> be placed on local and hdfs.
> 
>
> Key: SPARK-21325
> URL: https://issues.apache.org/jira/browse/SPARK-21325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.My submit way:
> spark-submit --class cn.gxl.TestSql{color:red} --jars 
> hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar
>  --files 
> hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} 
> hdfs://nameservice:/gxl/spark_2.0.2_project.jar
> 2.spark-submit description：
> --jars {color:red}JARS Comma-separated list of local jars{color} to include 
> on the driver and executor classpaths.
> --files{color:red} FILES Comma-separated list of files{color} to be placed in 
> the working directory of each executor. File paths of these files in 
> executors can be accessed via SparkFiles.get(fileName).
> 3.Problem Description：
> {color:red} jars and files Not only can be placed on local but also can be 
> placed on hdfs.
> The description of '' - jars '', that can only be placed on local.This is 
> wrong
> The description of '--files' is not clear that can be placed locally or 
> hdfs.This is blurry. Not conducive to the developer to understand and 
> use.*{color}
> So, this is an optimization feature that deserves to be modified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.

2017-07-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-21325:
--

> The shell of 'spark-submit' about '--jars' and '--fils', jars and files can 
> be placed on local and hdfs.
> 
>
> Key: SPARK-21325
> URL: https://issues.apache.org/jira/browse/SPARK-21325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.My submit way:
> spark-submit --class cn.gxl.TestSql{color:red} --jars 
> hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar
>  --files 
> hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} 
> hdfs://nameservice:/gxl/spark_2.0.2_project.jar
> 2.spark-submit description：
> --jars {color:red}JARS Comma-separated list of local jars{color} to include 
> on the driver and executor classpaths.
> --files{color:red} FILES Comma-separated list of files{color} to be placed in 
> the working directory of each executor. File paths of these files in 
> executors can be accessed via SparkFiles.get(fileName).
> 3.Problem Description：
> {color:red} jars and files Not only can be placed on local but also can be 
> placed on hdfs.
> The description of '' - jars '', that can only be placed on local.This is 
> wrong
> The description of '--files' is not clear that can be placed locally or 
> hdfs.This is blurry. Not conducive to the developer to understand and 
> use.*{color}
> So, this is an optimization feature that deserves to be modified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.

2017-07-27 Thread guoxiaolongzte (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104410#comment-16104410
 ] 

guoxiaolongzte commented on SPARK-21325:


https://issues.apache.org/jira/browse/SPARK-21012
Help me put this jira off,thanks.

> The shell of 'spark-submit' about '--jars' and '--fils', jars and files can 
> be placed on local and hdfs.
> 
>
> Key: SPARK-21325
> URL: https://issues.apache.org/jira/browse/SPARK-21325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.My submit way:
> spark-submit --class cn.gxl.TestSql{color:red} --jars 
> hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar
>  --files 
> hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} 
> hdfs://nameservice:/gxl/spark_2.0.2_project.jar
> 2.spark-submit description：
> --jars {color:red}JARS Comma-separated list of local jars{color} to include 
> on the driver and executor classpaths.
> --files{color:red} FILES Comma-separated list of files{color} to be placed in 
> the working directory of each executor. File paths of these files in 
> executors can be accessed via SparkFiles.get(fileName).
> 3.Problem Description：
> {color:red} jars and files Not only can be placed on local but also can be 
> placed on hdfs.
> The description of '' - jars '', that can only be placed on local.This is 
> wrong
> The description of '--files' is not clear that can be placed locally or 
> hdfs.This is blurry. Not conducive to the developer to understand and 
> use.*{color}
> So, this is an optimization feature that deserves to be modified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21553) Added the description of the default value of master parameter in the spark-shell

2017-07-27 Thread Donghui Xu (JIRA)

Donghui Xu created SPARK-21553:
--

 Summary: Added the description of the default value of master 
parameter in the spark-shell
 Key: SPARK-21553
 URL: https://issues.apache.org/jira/browse/SPARK-21553
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Affects Versions: 2.2.0
Reporter: Donghui Xu
Priority: Minor


When I type spark-shell --help, I find that the default value description for 
the master parameter is missing. The user does not know what the default value 
is when the master parameter is not included, so we need to add the master 
parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.

2017-07-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104402#comment-16104402
 ] 

Hyukjin Kwon commented on SPARK-21325:
--

[~guoxiaolongzte], Would you point out which JIRA this one duplicates?

> The shell of 'spark-submit' about '--jars' and '--fils', jars and files can 
> be placed on local and hdfs.
> 
>
> Key: SPARK-21325
> URL: https://issues.apache.org/jira/browse/SPARK-21325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.My submit way:
> spark-submit --class cn.gxl.TestSql{color:red} --jars 
> hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar
>  --files 
> hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} 
> hdfs://nameservice:/gxl/spark_2.0.2_project.jar
> 2.spark-submit description：
> --jars {color:red}JARS Comma-separated list of local jars{color} to include 
> on the driver and executor classpaths.
> --files{color:red} FILES Comma-separated list of files{color} to be placed in 
> the working directory of each executor. File paths of these files in 
> executors can be accessed via SparkFiles.get(fileName).
> 3.Problem Description：
> {color:red} jars and files Not only can be placed on local but also can be 
> placed on hdfs.
> The description of '' - jars '', that can only be placed on local.This is 
> wrong
> The description of '--files' is not clear that can be placed locally or 
> hdfs.This is blurry. Not conducive to the developer to understand and 
> use.*{color}
> So, this is an optimization feature that deserves to be modified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21552) Add decimal type support to ArrowWriter.

2017-07-27 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-21552:
-

 Summary: Add decimal type support to ArrowWriter.
 Key: SPARK-21552
 URL: https://issues.apache.org/jira/browse/SPARK-21552
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takuya Ueshin


Decimal type is not yet supported in ArrowWriter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-07-27 Thread roncenzhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104373#comment-16104373
 ] 

roncenzhao commented on SPARK-17321:


Hi, I encounter this problem too. Any process about this bug? Thanks~

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-27 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21306.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1
   2.1.2
   2.0.3

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
> Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104284#comment-16104284
 ] 

Kazuaki Ishizaki commented on SPARK-18016:
--

[~jamcon] Thank you reporting the problem.
We fixed a problem for the large number (e.g. 4000) of columns. However, we 
know that we have not solved a problem for the very large number (e.g. 12000) 
of columns.
I have just pinged the author that created the fix to solve these two problems.


> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
>

[jira] [Closed] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.

2017-07-27 Thread guoxiaolongzte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte closed SPARK-21325.
--

duplicate

> The shell of 'spark-submit' about '--jars' and '--fils', jars and files can 
> be placed on local and hdfs.
> 
>
> Key: SPARK-21325
> URL: https://issues.apache.org/jira/browse/SPARK-21325
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.My submit way:
> spark-submit --class cn.gxl.TestSql{color:red} --jars 
> hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar
>  --files 
> hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} 
> hdfs://nameservice:/gxl/spark_2.0.2_project.jar
> 2.spark-submit description：
> --jars {color:red}JARS Comma-separated list of local jars{color} to include 
> on the driver and executor classpaths.
> --files{color:red} FILES Comma-separated list of files{color} to be placed in 
> the working directory of each executor. File paths of these files in 
> executors can be accessed via SparkFiles.get(fileName).
> 3.Problem Description：
> {color:red} jars and files Not only can be placed on local but also can be 
> placed on hdfs.
> The description of '' - jars '', that can only be placed on local.This is 
> wrong
> The description of '--files' is not clear that can be placed locally or 
> hdfs.This is blurry. Not conducive to the developer to understand and 
> use.*{color}
> So, this is an optimization feature that deserves to be modified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-27 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103202#comment-16103202
 ] 

xinzhang edited comment on SPARK-21067 at 7/28/17 1:06 AM:
---

same here.
problem  reappeared in Spark 2.1.0 thriftserver :

Open Beeline Session 1
Create Table 1 (Success)
Open Beeline Session 2
Create Table 2 (Success)
Close Beeline Session 1
Create Table 3 in Beeline Session 2 (FAIL)

use parquet,  the issue is not present .

[~cloud_fan]


was (Author: zhangxin0112zx):
same here.
problem  reappeared in Spark 2.1.0 thriftserver :

Open Beeline Session 1
Create Table 1 (Success)
Open Beeline Session 2
Create Table 2 (Success)
Close Beeline Session 1
Create Table 3 in Beeline Session 2 (FAIL)

use parquet,  the issue is not present .

@Wenchen Fan

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
>

[jira] [Resolved] (SPARK-21538) Attribute resolution inconsistency in Dataset API

2017-07-27 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21538.
-
   Resolution: Fixed
 Assignee: Anton Okolnychyi
Fix Version/s: 2.3.0
   2.2.1

> Attribute resolution inconsistency in Dataset API
> -
>
> Key: SPARK-21538
> URL: https://issues.apache.org/jira/browse/SPARK-21538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>Assignee: Anton Okolnychyi
> Fix For: 2.2.1, 2.3.0
>
>
> {code}
> spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
> spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
> spark.range(1).withColumnRenamed("id", "x").sort('id) // works
> spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
> (x);
> ...
> {code}
> It looks like the Dataset API functions taking {{String}} use the basic 
> resolver that only look at the columns at that level, whereas all the other 
> means of expressing an attribute are lazily resolved during the analyzer.
> The reason why the first 3 calls work is explained in the docs for {{object 
> ResolveMissingReferences}}:
> {code}
>   /**
>* In many dialects of SQL it is valid to sort by attributes that are not 
> present in the SELECT
>* clause.  This rule detects such queries and adds the required attributes 
> to the original
>* projection, so that they will be available during sorting. Another 
> projection is added to
>* remove these attributes after sorting.
>*
>* The HAVING clause could also used a grouping columns that is not 
> presented in the SELECT.
>*/
> {code}
> For consistency, it would be good to use the same attribute resolution 
> mechanism everywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-27 Thread James Conner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103853#comment-16103853
 ] 

James Conner edited comment on SPARK-18016 at 7/27/17 8:31 PM:
---

The issue does not appear to be completely fixed.  I downloaded the master 
today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 
2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) 
Regression on a dataframe which contains 2658 columns.  I'm still getting the 
same constant pool exceeding JVM 0x error.  The steps that I'm using to 
generate the error are:

{code:java}
// Imports
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column}

// Load data
val mainDF = 
spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024)

// Impute data
val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE")
val outColsMain = inColsMain.map(i=>(i+"_imputed"))
val mainImputer = new 
Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean")
val mainImputerBuild = mainImputer.fit(mainDF)
val imputedMainDF = mainImputerBuild.transform(mainDF)

// Drop original feature columns, retain imputed
val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => 
!inColsMain.contains(colName)).map(colName => new Column(colName)): _*)

// Split data for testing & training
val Array(trainDF, testDF) = fullData.randomSplit(Array(0.80, 0.20),seed = 
12345)

// Vector Assembler
val arrayName = fullData.columns.filter(_ != "ID").filter(_ != "SCORE")
val assembler = new 
VectorAssembler().setInputCols(arrayName).setOutputCol("features")

// GBT Object
val gbt = new 
GBTRegressor().setLabelCol("SCORE").setFeaturesCol("features").setMaxIter(5).setSeed(1993).setLossType("squared").setSubsamplingRate(1)

// Pipeline Object
val pipeline = new Pipeline().setStages(Array(assembler, gbt))

// Hyper Parameter Grid Object
val paramGrid = new ParamGridBuilder().addGrid(gbt.maxBins, Array(2, 4, 
8)).addGrid(gbt.maxDepth, Array(1, 2, 4)).addGrid(gbt.stepSize, Array(0.1, 
0.2)).build()

// Evaluator Object
val evaluator = new 
RegressionEvaluator().setLabelCol("SCORE").setPredictionCol("prediction").setMetricName("rmse")

// Cross Validation Object
val cv = new 
CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)

// Build the model
val gbtModel = cv.fit(trainDF)
{code}

Upon building the model, it errors out with the following cause:
{code:java}
java.lang.RuntimeException: Error while encoding: 
org.codehaus.janino.JaninoRuntimeException: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2
 has grown past JVM limit of 0x
{code}


was (Author: jamcon):
The issue does not appear to be completely fixed.  I downloaded the master 
today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 
2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) 
Regression on a dataframe which contains 2658 columns.  I'm still getting the 
same constant pool exceeding JVM 0x error.  The steps that I'm using to 
generate the error are:

{code:scala}
// Imports
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column}

// Load data
val mainDF = 
spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024)

// Impute data
val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE")
val outColsMain = inColsMain.map(i=>(i+"_imputed"))
val mainImputer = new 
Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean")
val mainImputerBuild = mainImputer.fit(mainDF)
val imputedMainDF = mainImputerBuild.transform(mainDF)

// Drop original feature columns, retain imputed
val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => 
!inColsMain.contains(colName)).map(colName => new Column(colName)): _*)

// Split data for testing & training
val Array(trainDF, testDF)

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-27 Thread James Conner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103853#comment-16103853
 ] 

James Conner commented on SPARK-18016:
--

The issue does not appear to be completely fixed.  I downloaded the master 
today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 
2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) 
Regression on a dataframe which contains 2658 columns.  I'm still getting the 
same constant pool exceeding JVM 0x error.  The steps that I'm using to 
generate the error are:

{code:scala}
// Imports
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column}

// Load data
val mainDF = 
spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024)

// Impute data
val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE")
val outColsMain = inColsMain.map(i=>(i+"_imputed"))
val mainImputer = new 
Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean")
val mainImputerBuild = mainImputer.fit(mainDF)
val imputedMainDF = mainImputerBuild.transform(mainDF)

// Drop original feature columns, retain imputed
val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => 
!inColsMain.contains(colName)).map(colName => new Column(colName)): _*)

// Split data for testing & training
val Array(trainDF, testDF) = fullData.randomSplit(Array(0.80, 0.20),seed = 
12345)

// Vector Assembler
val arrayName = fullData.columns.filter(_ != "ID").filter(_ != "SCORE")
val assembler = new 
VectorAssembler().setInputCols(arrayName).setOutputCol("features")

// GBT Object
val gbt = new 
GBTRegressor().setLabelCol("SCORE").setFeaturesCol("features").setMaxIter(5).setSeed(1993).setLossType("squared").setSubsamplingRate(1)

// Pipeline Object
val pipeline = new Pipeline().setStages(Array(assembler, gbt))

// Hyper Parameter Grid Object
val paramGrid = new ParamGridBuilder().addGrid(gbt.maxBins, Array(2, 4, 
8)).addGrid(gbt.maxDepth, Array(1, 2, 4)).addGrid(gbt.stepSize, Array(0.1, 
0.2)).build()

// Evaluator Object
val evaluator = new 
RegressionEvaluator().setLabelCol("SCORE").setPredictionCol("prediction").setMetricName("rmse")

// Cross Validation Object
val cv = new 
CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5)

// Build the model
val gbtModel = cv.fit(trainDF)
{code}

Upon building the model, it errors out with the following cause:
{code:java}
java.lang.RuntimeException: Error while encoding: 
org.codehaus.janino.JaninoRuntimeException: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2
 has grown past JVM limit of 0x
{code}

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at

[jira] [Commented] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data

2017-07-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103629#comment-16103629
 ] 

Sean Owen commented on SPARK-21550:
---

I think this was fixed in https://issues.apache.org/jira/browse/SPARK-19573

> approxQuantiles throws "next on empty iterator" on empty data
> -
>
> Key: SPARK-21550
> URL: https://issues.apache.org/jira/browse/SPARK-21550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: peay
>
> The documentation says:
> {code}
> null and NaN values will be removed from the numerical column before 
> calculation. If
> the dataframe is empty or the column only contains null or NaN, an empty 
> array is returned.
> {code}
> However, this small pyspark example
> {code}
> sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 
> 0.001)
> {code}
> throws
> {code}
> Py4JJavaError: An error occurred while calling o3493.approxQuantile.
> : java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.last(TraversableLike.scala:431)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
>   at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data

2017-07-27 Thread peay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay closed SPARK-21550.

   Resolution: Duplicate
Fix Version/s: 2.2.0

> approxQuantiles throws "next on empty iterator" on empty data
> -
>
> Key: SPARK-21550
> URL: https://issues.apache.org/jira/browse/SPARK-21550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: peay
> Fix For: 2.2.0
>
>
> The documentation says:
> {code}
> null and NaN values will be removed from the numerical column before 
> calculation. If
> the dataframe is empty or the column only contains null or NaN, an empty 
> array is returned.
> {code}
> However, this small pyspark example
> {code}
> sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 
> 0.001)
> {code}
> throws
> {code}
> Py4JJavaError: An error occurred while calling o3493.approxQuantile.
> : java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.last(TraversableLike.scala:431)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
>   at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
>   at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-27 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103547#comment-16103547
 ] 

yuhao yang commented on SPARK-21535:


It's not in my opinion.  https://issues.apache.org/jira/browse/SPARK-21086 is 
trying to store all the trained models in the TrainValidationSplitModel or 
CrossValidatorModel according to the discussion, and with a control parameter 
which is turned off by default. Anyway changing the training process hardly has 
an impact on that.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-07-27 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103543#comment-16103543
 ] 

Reynold Xin commented on SPARK-21551:
-

Sure.

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Priority: Critical
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-07-27 Thread peay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103539#comment-16103539
 ] 

peay commented on SPARK-21551:
--

Sure, does 15 seconds sound good?

> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Priority: Critical
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-07-27 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103536#comment-16103536
 ] 

Reynold Xin commented on SPARK-21551:
-

Do you want to submit a pull request?


> pyspark's collect fails when getaddrinfo is too slow
> 
>
> Key: SPARK-21551
> URL: https://issues.apache.org/jira/browse/SPARK-21551
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: peay
>Priority: Critical
>
> Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
> {{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
> and having Python connect to it to download the data.
> All three are implemented along the lines of:
> {code}
> port = self._jdf.collectToPython()
> return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> {code}
> The server has **a hardcoded timeout of 3 seconds** 
> (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
>  -- i.e., the Python process has 3 seconds to connect to it from the very 
> moment the driver server starts.
> In general, that seems fine, but I have been encountering frequent timeouts 
> leading to `Exception: could not open socket`.
> After investigating a bit, it turns out that {{_load_from_socket}} makes a 
> call to {{getaddrinfo}}:
> {code}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
>.. connect ..
> {code}
> I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
> only take a couple milliseconds, about 10% of them take between 2 and 10 
> seconds, leading to about 10% of jobs failing. I don't think we can always 
> expect {{getaddrinfo}} to return instantly. More generally, Python may 
> sometimes pause for a couple seconds, which may not leave enough time for the 
> process to connect to the server.
> Especially since the server timeout is hardcoded, I think it would be best to 
> set a rather generous value (15 seconds?) to avoid such situations.
> A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
> it before starting up the driver server.
>  
> cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow

2017-07-27 Thread peay (JIRA)

peay created SPARK-21551:


 Summary: pyspark's collect fails when getaddrinfo is too slow
 Key: SPARK-21551
 URL: https://issues.apache.org/jira/browse/SPARK-21551
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: peay
Priority: Critical


Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and 
{{DataFrame.collect}} all work by starting an ephemeral server in the driver, 
and having Python connect to it to download the data.

All three are implemented along the lines of:

{code}
port = self._jdf.collectToPython()
return list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
{code}

The server has **a hardcoded timeout of 3 seconds** 
(https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695)
 -- i.e., the Python process has 3 seconds to connect to it from the very 
moment the driver server starts.

In general, that seems fine, but I have been encountering frequent timeouts 
leading to `Exception: could not open socket`.

After investigating a bit, it turns out that {{_load_from_socket}} makes a call 
to {{getaddrinfo}}:

{code}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
   .. connect ..
{code}

I am not sure why, but while most such calls to {{getaddrinfo}} on my machine 
only take a couple milliseconds, about 10% of them take between 2 and 10 
seconds, leading to about 10% of jobs failing. I don't think we can always 
expect {{getaddrinfo}} to return instantly. More generally, Python may 
sometimes pause for a couple seconds, which may not leave enough time for the 
process to connect to the server.

Especially since the server timeout is hardcoded, I think it would be best to 
set a rather generous value (15 seconds?) to avoid such situations.

A {{getaddrinfo}}  specific fix could avoid doing it every single time, or do 
it before starting up the driver server.
 
cc SPARK-677 [~davies]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data

2017-07-27 Thread peay (JIRA)

peay created SPARK-21550:


 Summary: approxQuantiles throws "next on empty iterator" on empty 
data
 Key: SPARK-21550
 URL: https://issues.apache.org/jira/browse/SPARK-21550
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: peay


The documentation says:
{code}
null and NaN values will be removed from the numerical column before 
calculation. If
the dataframe is empty or the column only contains null or NaN, an empty array 
is returned.
{code}

However, this small pyspark example
{code}
sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 
0.001)
{code}

throws

{code}
Py4JJavaError: An error occurred while calling o3493.approxQuantile.
: java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at 
scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
at scala.collection.IterableLike$class.head(IterableLike.scala:107)
at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
at 
scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.last(TraversableLike.scala:431)
at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
at 
scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
at 
org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21549) Spark fails to abort job correctly in case of custom OutputFormat implementations

2017-07-27 Thread Sergey Zhemzhitsky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Description: 
Spark fails to abort job correctly in case of custom OutputFormat 
implementations.

There are OutputFormat implementations which do not need to use 
*mapreduce.output.fileoutputformat.outputdir* standard hadoop property.

[But spark reads this property from the 
configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
 while setting up an OutputCommitter
{code:javascript}
val committer = FileCommitProtocol.instantiate(
  className = classOf[HadoopMapReduceCommitProtocol].getName,
  jobId = stageId.toString,
  outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
  isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
committer.setupJob(jobContext)
{code}

In that case if job fails Spark executes 
[committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106]
{code:javascript}
committer.abortJob(jobContext)
{code}
... and fails with the following exception
{code}
Can not create a Path from a null string
java.lang.IllegalArgumentException: Can not create a Path from a null string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
  at org.apache.hadoop.fs.Path.(Path.java:135)
  at org.apache.hadoop.fs.Path.(Path.java:89)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
  at 
org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
{code}

  was:
Spark fails to abort job correctly in case of custom OutputFormat 
implementations.

There are OutputFormat implementations which do not need to use 
*mapreduce.output.fileoutputformat.outputdir* standard hadoop property.

[But spark reads this property from the 
configuration.|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
 while setting up an OutputCommitter
{code:javascript}
val committer = FileCommitProtocol.instantiate(
  className = classOf[HadoopMapReduceCommitProtocol].getName,
  jobId = stageId.toString,
  outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
  isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
committer.setupJob(jobContext)
{code}

In that case if job fails Spark executes 
[committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106]
{code:javascript}
committer.abortJob(jobContext)
{code}
... and fails with the following exception
{code}
Can not create a Path from a null string
java.lang.IllegalArgumentException: Can not create a Path from a null string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
  at org.apache.hadoop.fs.Path.(Path.java:135)
  at org.apache.hadoop.fs.Path.(Path.java:89)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
  at 
org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)

[jira] [Created] (SPARK-21549) Spark fails to abort job correctly in case of custom OutputFormat implementations

2017-07-27 Thread Sergey Zhemzhitsky (JIRA)

Sergey Zhemzhitsky created SPARK-21549:
--

 Summary: Spark fails to abort job correctly in case of custom 
OutputFormat implementations
 Key: SPARK-21549
 URL: https://issues.apache.org/jira/browse/SPARK-21549
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: spark 2.2.0
scala 2.11
Reporter: Sergey Zhemzhitsky
Priority: Critical


Spark fails to abort job correctly in case of custom OutputFormat 
implementations.

There are OutputFormat implementations which do not need to use 
*mapreduce.output.fileoutputformat.outputdir* standard hadoop property.

[But spark reads this property from the 
configuration.|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
 while setting up an OutputCommitter
{code:javascript}
val committer = FileCommitProtocol.instantiate(
  className = classOf[HadoopMapReduceCommitProtocol].getName,
  jobId = stageId.toString,
  outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
  isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
committer.setupJob(jobContext)
{code}

In that case if job fails Spark executes 
[committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106]
{code:javascript}
committer.abortJob(jobContext)
{code}
... and fails with the following exception
{code}
Can not create a Path from a null string
java.lang.IllegalArgumentException: Can not create a Path from a null string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
  at org.apache.hadoop.fs.Path.(Path.java:135)
  at org.apache.hadoop.fs.Path.(Path.java:89)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
  at 
org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103459#comment-16103459
 ] 

DjvuLee commented on SPARK-21547:
-

Yes, I agree that this has a relationship with the work, but doing nothing 
about 3min is too long for a Streaming Application.

My proposal is try to let us to inspect whether the current cleaner strategy is 
good enough.

> Spark cleaner cost too many time
> 
>
> Key: SPARK-21547
> URL: https://issues.apache.org/jira/browse/SPARK-21547
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: DjvuLee
>
> Spark Streaming sometime cost so many time deal with cleaning, and this can 
> become worse when enable the dynamic allocation.
> I post the Driver's Log in the following comments, we can find that the 
> cleaner costs more than 2min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21548) Support insert into serial columns of table

2017-07-27 Thread LvDongrong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103336#comment-16103336
 ] 

LvDongrong commented on SPARK-21548:


we try to solve it through this way: https://github.com/lvdongr/spark/pull/1

> Support insert into serial columns of table
> ---
>
> Key: SPARK-21548
> URL: https://issues.apache.org/jira/browse/SPARK-21548
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: LvDongrong
>
> When we use the 'insert into ...' statement we can only insert all the 
> columns into table.But int some cases,our table has many columns and we are 
> only interest in some of them.So we want to support the statement "insert 
> into table tbl (column1, column2,...) values (value1, value2, value3,...)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21319) UnsafeExternalRowSorter.RowComparator memory leak

2017-07-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21319.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18679
[https://github.com/apache/spark/pull/18679]

> UnsafeExternalRowSorter.RowComparator memory leak
> -
>
> Key: SPARK-21319
> URL: https://issues.apache.org/jira/browse/SPARK-21319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: James Baker
> Fix For: 2.3.0
>
> Attachments: 
> 0001-SPARK-21319-Fix-memory-leak-in-UnsafeExternalRowSort.patch, hprof.png
>
>
> When we wish to sort within partitions, we produce an 
> UnsafeExternalRowSorter. This contains an UnsafeExternalSorter, which 
> contains the UnsafeExternalRowComparator.
> The UnsafeExternalSorter adds a task completion listener which performs any 
> additional required cleanup. The upshot of this is that we maintain a 
> reference to the UnsafeExternalRowSorter.RowComparator until the end of the 
> task.
> The RowComparator looks like
> {code:java}
>   private static final class RowComparator extends RecordComparator {
> private final Ordering ordering;
> private final int numFields;
> private final UnsafeRow row1;
> private final UnsafeRow row2;
> RowComparator(Ordering ordering, int numFields) {
>   this.numFields = numFields;
>   this.row1 = new UnsafeRow(numFields);
>   this.row2 = new UnsafeRow(numFields);
>   this.ordering = ordering;
> }
> @Override
> public int compare(Object baseObj1, long baseOff1, Object baseObj2, long 
> baseOff2) {
>   // TODO: Why are the sizes -1?
>   row1.pointTo(baseObj1, baseOff1, -1);
>   row2.pointTo(baseObj2, baseOff2, -1);
>   return ordering.compare(row1, row2);
> }
> }
> {code}
> which means that this will contain references to the last baseObjs that were 
> passed in, and without tracking them for purposes of memory allocation.
> We have a job which sorts within partitions and then coalesces partitions - 
> this has a tendency to OOM because of the references to old UnsafeRows that 
> were used during the sorting.
> Attached is a screenshot of a memory dump during a task - our JVM has two 
> executor threads.
> It can be seen that we have 2 references inside of row iterators, and 11 more 
> which are only known in the task completion listener or as part of memory 
> management.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21319) UnsafeExternalRowSorter.RowComparator memory leak

2017-07-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21319:
---

Assignee: Wenchen Fan

> UnsafeExternalRowSorter.RowComparator memory leak
> -
>
> Key: SPARK-21319
> URL: https://issues.apache.org/jira/browse/SPARK-21319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: James Baker
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
> Attachments: 
> 0001-SPARK-21319-Fix-memory-leak-in-UnsafeExternalRowSort.patch, hprof.png
>
>
> When we wish to sort within partitions, we produce an 
> UnsafeExternalRowSorter. This contains an UnsafeExternalSorter, which 
> contains the UnsafeExternalRowComparator.
> The UnsafeExternalSorter adds a task completion listener which performs any 
> additional required cleanup. The upshot of this is that we maintain a 
> reference to the UnsafeExternalRowSorter.RowComparator until the end of the 
> task.
> The RowComparator looks like
> {code:java}
>   private static final class RowComparator extends RecordComparator {
> private final Ordering ordering;
> private final int numFields;
> private final UnsafeRow row1;
> private final UnsafeRow row2;
> RowComparator(Ordering ordering, int numFields) {
>   this.numFields = numFields;
>   this.row1 = new UnsafeRow(numFields);
>   this.row2 = new UnsafeRow(numFields);
>   this.ordering = ordering;
> }
> @Override
> public int compare(Object baseObj1, long baseOff1, Object baseObj2, long 
> baseOff2) {
>   // TODO: Why are the sizes -1?
>   row1.pointTo(baseObj1, baseOff1, -1);
>   row2.pointTo(baseObj2, baseOff2, -1);
>   return ordering.compare(row1, row2);
> }
> }
> {code}
> which means that this will contain references to the last baseObjs that were 
> passed in, and without tracking them for purposes of memory allocation.
> We have a job which sorts within partitions and then coalesces partitions - 
> this has a tendency to OOM because of the references to old UnsafeRows that 
> were used during the sorting.
> Attached is a screenshot of a memory dump during a task - our JVM has two 
> executor threads.
> It can be seen that we have 2 references inside of row iterators, and 11 more 
> which are only known in the task completion listener or as part of memory 
> management.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn

2017-07-27 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang closed SPARK-21539.

Resolution: Duplicate

> Job should not be aborted when dynamic allocation is enabled or 
> spark.executor.instances larger then current allocated number by yarn
> -
>
> Key: SPARK-21539
> URL: https://issues.apache.org/jira/browse/SPARK-21539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For spark on yarn.
> Right now, when TaskSet can not run on any node or host.Which means 
> blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted.
> However, if dynamic allocation is enabled, we should wait for yarn to 
> allocate new nodemanager in order to execute job successfully.
> How to reproduce?
> 1、Set up a yarn cluster with  5 nodes.And assign a node1 with much larger cpu 
> core and memory,which can let yarn launch container on this node even it is 
> blacklisted by TaskScheduler.
> 2、modify BlockManager#registerWithExternalShuffleServer
> {code:java}
> logInfo("Registering executor with local external shuffle service.")
> val shuffleConfig = new ExecutorShuffleInfo(
>   diskBlockManager.localDirs.map(_.toString),
>   diskBlockManager.subDirsPerLocalDir,
>   shuffleManager.getClass.getName)
> val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS)
> val SLEEP_TIME_SECS = 5
> for (i <- 1 to MAX_ATTEMPTS) {
>   try {
> {color:red}if (shuffleId.host.equals("node1's address")) {
>  throw new Exception
> }{color}
> // Synchronous and will throw an exception if we cannot connect.
> 
> shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer(
>   shuffleServerId.host, shuffleServerId.port, 
> shuffleServerId.executorId, shuffleConfig)
> return
>   } catch {
> case e: Exception if i < MAX_ATTEMPTS =>
>   logError(s"Failed to connect to external shuffle server, will retry 
> ${MAX_ATTEMPTS - i}"
> + s" more times after waiting $SLEEP_TIME_SECS seconds...", e)
>   Thread.sleep(SLEEP_TIME_SECS * 1000)
> case NonFatal(e) =>
>   throw new SparkException("Unable to register with external shuffle 
> server due to : " +
> e.getMessage, e)
>   }
> }
> {code}
> add logic in red.
> 3、set shuffle service enable as true and open shuffle service for yarn.
> Then yarn will always launch executor on node1 but failed since shuffle 
> service can not register success.
> Then job will be aborted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19270) Add summary table to GLM summary

2017-07-27 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19270.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-07-27 Thread Rob Genova (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103228#comment-16103228
 ] 

Rob Genova commented on SPARK-19700:


FYI, The Apache Spark fork enhanced to support HashiCorp Nomad as a scheduler 
is now located at: https://github.com/hashicorp/nomad-spark. If you are 
interested in trying it out, the best place to get started is: 
https://www.nomadproject.io/guides/spark/spark.html.

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20992) Support for Nomad as scheduler backend

2017-07-27 Thread Rob Genova (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103224#comment-16103224
 ] 

Rob Genova commented on SPARK-20992:


The Apache Spark fork enhanced to support HashiCorp Nomad as a scheduler is now 
located at: https://github.com/hashicorp/nomad-spark. If you are interested in 
trying it out, the best place to get started is here: 
https://www.nomadproject.io/guides/spark/spark.html.

> Support for Nomad as scheduler backend
> --
>
> Key: SPARK-20992
> URL: https://issues.apache.org/jira/browse/SPARK-20992
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Ben Barnard
>
> It is convenient to have scheduler backend support for running applications 
> on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so 
> that users can run Spark applications on a Nomad cluster without the need to 
> bring up a Spark Standalone cluster in the Nomad cluster.
> Both client and cluster deploy modes should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-27 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103202#comment-16103202
 ] 

xinzhang commented on SPARK-21067:
--

same here.
problem  reappeared in Spark 2.1.0 thriftserver :

Open Beeline Session 1
Create Table 1 (Success)
Open Beeline Session 2
Create Table 2 (Success)
Close Beeline Session 1
Create Table 3 in Beeline Session 2 (FAIL)

use parquet,  the issue is not present .

Wenchen Fan

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
>

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-27 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103202#comment-16103202
 ] 

xinzhang edited comment on SPARK-21067 at 7/27/17 1:27 PM:
---

same here.
problem  reappeared in Spark 2.1.0 thriftserver :

Open Beeline Session 1
Create Table 1 (Success)
Open Beeline Session 2
Create Table 2 (Success)
Close Beeline Session 1
Create Table 3 in Beeline Session 2 (FAIL)

use parquet,  the issue is not present .

@Wenchen Fan


was (Author: zhangxin0112zx):
same here.
problem  reappeared in Spark 2.1.0 thriftserver :

Open Beeline Session 1
Create Table 1 (Success)
Open Beeline Session 2
Create Table 2 (Success)
Close Beeline Session 1
Create Table 3 in Beeline Session 2 (FAIL)

use parquet,  the issue is not present .

Wenchen Fan

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
>

[jira] [Created] (SPARK-21548) Support insert into serial columns of table

2017-07-27 Thread LvDongrong (JIRA)

LvDongrong created SPARK-21548:
--

 Summary: Support insert into serial columns of table
 Key: SPARK-21548
 URL: https://issues.apache.org/jira/browse/SPARK-21548
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: LvDongrong


When we use the 'insert into ...' statement we can only insert all the columns 
into table.But int some cases,our table has many columns and we are only 
interest in some of them.So we want to support the statement "insert into table 
tbl (column1, column2,...) values (value1, value2, value3,...)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-07-27 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103152#comment-16103152
 ] 

Nick Pentreath commented on SPARK-21535:


Isn't this in direct opposition to 
https://issues.apache.org/jira/browse/SPARK-21086?

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21440) Refactor ArrowConverters and add ArrayType and StructType support.

2017-07-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21440.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18655
[https://github.com/apache/spark/pull/18655]

> Refactor ArrowConverters and add ArrayType and StructType support.
> --
>
> Key: SPARK-21440
> URL: https://issues.apache.org/jira/browse/SPARK-21440
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
> Fix For: 2.3.0
>
>
> This is a refactoring of {{ArrowConverters}} and related classes.
> # Refactor {{ColumnWriter}} as {{ArrowWriter}}.
> # Add {{ArrayType}} and {{StructType}} support.
> # Refactor {{ArrowConverters}} to skip intermediate {{ArrowRecordBatch}} 
> creation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21440) Refactor ArrowConverters and add ArrayType and StructType support.

2017-07-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21440:
---

Assignee: Takuya Ueshin

> Refactor ArrowConverters and add ArrayType and StructType support.
> --
>
> Key: SPARK-21440
> URL: https://issues.apache.org/jira/browse/SPARK-21440
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.3.0
>
>
> This is a refactoring of {{ArrowConverters}} and related classes.
> # Refactor {{ColumnWriter}} as {{ArrowWriter}}.
> # Add {{ArrayType}} and {{StructType}} support.
> # Refactor {{ArrowConverters}} to skip intermediate {{ArrowRecordBatch}} 
> creation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-27 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103035#comment-16103035
 ] 

Stavros Kontopoulos edited comment on SPARK-15142 at 7/27/17 10:12 AM:
---

[~devaraj.k] Thnx. Btw I was not able to reproduce the problem with drivers 
being queued. After master was up then things went fine. I was able to launch 
more drivers.  https://gist.github.com/skonto/0e23af643e7271e0125e321b04e630a2


was (Author: skonto):
[~devaraj.k] Thnx. Btw I was not able to reproduce the problem with drivers 
being queued. After master was up then things went fine. I was able to launch 
more drivers.  https://gist.github.com/skonto/0e23af643e7271e0125e321b04e630a2

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-27 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103035#comment-16103035
 ] 

Stavros Kontopoulos commented on SPARK-15142:
-

[~devaraj.k] Thnx. Btw I was not able to reproduce the problem with drivers 
being queued. After master was up then things went fine. I was able to launch 
more drivers.  https://gist.github.com/skonto/0e23af643e7271e0125e321b04e630a2

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103021#comment-16103021
 ] 

Sean Owen commented on SPARK-21547:
---

It's not clear that this is "too slow", relative to whatever work it's doing. 
Why is it unnecessarily slow, or what change are you proposing?

> Spark cleaner cost too many time
> 
>
> Key: SPARK-21547
> URL: https://issues.apache.org/jira/browse/SPARK-21547
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: DjvuLee
>
> Spark Streaming sometime cost so many time deal with cleaning, and this can 
> become worse when enable the dynamic allocation.
> I post the Driver's Log in the following comments, we can find that the 
> cleaner costs more than 2min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread DjvuLee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-21547:

Description: 
Spark Streaming sometime cost so many time deal with cleaning, and this can 
become worse when enable the dynamic allocation.

I post the Driver's Log in the following comments, we can find that the cleaner 
costs more than 2min.

  was:
Spark Streaming sometime cost so many time deal with cleaning, and this can 
become worse when enable the dynamic allocation.

I post the Driver Log in the following, in this log we can find that the 
cleaner cost more than 2min.


> Spark cleaner cost too many time
> 
>
> Key: SPARK-21547
> URL: https://issues.apache.org/jira/browse/SPARK-21547
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: DjvuLee
>
> Spark Streaming sometime cost so many time deal with cleaning, and this can 
> become worse when enable the dynamic allocation.
> I post the Driver's Log in the following comments, we can find that the 
> cleaner costs more than 2min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread DjvuLee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-21547:

Description: 
Spark Streaming sometime cost so many time deal with cleaning, and this can 
become worse when enable the dynamic allocation.

I post the Driver Log in the following, in this log we can find that the 
cleaner cost more than 2min.

  was:Spark Streaming sometime cost so many time deal with cleaning, and this 
can become worse when enable the dynamic allocation.


> Spark cleaner cost too many time
> 
>
> Key: SPARK-21547
> URL: https://issues.apache.org/jira/browse/SPARK-21547
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: DjvuLee
>
> Spark Streaming sometime cost so many time deal with cleaning, and this can 
> become worse when enable the dynamic allocation.
> I post the Driver Log in the following, in this log we can find that the 
> cleaner cost more than 2min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103005#comment-16103005
 ] 

DjvuLee commented on SPARK-21547:
-

17/07/27 11:29:51 INFO TaskSetManager: Finished task 169.0 in stage 1504.0 (TID 
1504369) in 43975 ms on n6-195-137.byted.org (999/1000)
17/07/27 11:29:55 INFO TaskSetManager: Finished task 882.0 in stage 1504.0 (TID 
1504905) in 44153 ms on n6-195-137.byted.org (1000/1000)
17/07/27 11:29:55 INFO YarnScheduler: Removed TaskSet 1504.0, whose tasks have 
all completed, from pool
17/07/27 11:29:55 INFO DAGScheduler: ResultStage 1504 (call at 
/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py:2230) finished in 
457.863 s
17/07/27 11:29:55 INFO DAGScheduler: Job 1504 finished: call at 
/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py:2230, took 
457.877969 s
17/07/27 11:30:02 INFO JobScheduler: Added jobs for time 150112620 ms
17/07/27 11:30:32 INFO JobScheduler: Added jobs for time 150112623 ms
17/07/27 11:31:02 INFO JobScheduler: Added jobs for time 150112626 ms
17/07/27 11:31:32 INFO JobScheduler: Added jobs for time 150112629 ms
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906391
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906392
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906396
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906402
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906404
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492509
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492508
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492507
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492506
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492505
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492504
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492503
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492502
...
7/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906397
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906398
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906395
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906399
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906403
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906400
17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906401
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
10.6.131.75:23734 in memory (size: 35.9 KB, free: 2.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-157-227.byted.org:13090 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-157-158.byted.org:21120 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n6-195-150.byted.org:13277 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-156-165.byted.org:35355 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n6-132-023.byted.org:52521 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-136-133.byted.org:25696 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-150-029.byted.org:34673 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-148-038.byted.org:22503 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 
n8-150-038.byted.org:28209 in memory (size: 35.9 KB, free: 9.4 GB)

...

17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on 
n8-163-151.byted.org:33703 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on 
n8-148-028.byted.org:36086 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on 
n8-151-039.byted.org:21081 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on 
n8-157-167.byted.org:29370 in memory (size: 35.9 KB, free: 9.4 GB)
17/07/27 11:32:02 INFO JobScheduler: Added jobs for time 150112632 ms
17/07/27 11:32:32 INFO JobScheduler: Added jobs for time 150112635 ms
17/07/27 11:32:45 INFO JobScheduler: Finished job streaming job 150111696 
ms.0 from job set of time 150111696 ms
17/07/27 11:32:45 INFO JobScheduler: Total delay: 9405.183 s for time 
150111696 ms (execution: 1169.595 s)
17/07/27 11:32:45 INFO JobScheduler: Starting job streaming

[jira] [Updated] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread DjvuLee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-21547:

Description: Spark Streaming sometime cost so many time deal with cleaning, 
and this can become worse when enable the dynamic allocation.

> Spark cleaner cost too many time
> 
>
> Key: SPARK-21547
> URL: https://issues.apache.org/jira/browse/SPARK-21547
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: DjvuLee
>
> Spark Streaming sometime cost so many time deal with cleaning, and this can 
> become worse when enable the dynamic allocation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21547) Spark cleaner cost too many time

2017-07-27 Thread DjvuLee (JIRA)

DjvuLee created SPARK-21547:
---

 Summary: Spark cleaner cost too many time
 Key: SPARK-21547
 URL: https://issues.apache.org/jira/browse/SPARK-21547
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.0
Reporter: DjvuLee






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20990) Multi-line support for JSON

2017-07-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102975#comment-16102975
 ] 

Marco Gaido commented on SPARK-20990:
-

A PR fixing it is ready: https://github.com/apache/spark/pull/18731.

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-07-27 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102971#comment-16102971
 ] 

Nick Pentreath commented on SPARK-10802:


For those that may be interested - I opened a PR to add this functionality to 
{{ml}}'s {{ALSModel}} here: https://github.com/apache/spark/pull/18748

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-27 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102969#comment-16102969
 ] 

Liang-Chi Hsieh commented on SPARK-21274:
-

[~Tagar] Is the rewrite of INTERSECT ALL correct?

Take the example at 
https://github.com/apache/spark/pull/11106#issuecomment-182603275:
{code}
[1, 2, 2] intersect_all [1, 2] == [1, 2]
[1, 2, 2] intersect_all [1, 2, 2] == [1, 2, 2]
{code}

Looks like the rewrite returns [1, 2, 2] for two queries. Isn't? Or I misread 
something?


> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure

2017-07-27 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-21546:

Description: 
With today's master...

The following streaming query with watermark and {{dropDuplicates}} yields 
{{RuntimeException}} due to failure in binding.

{code}
val topic1 = spark.
  readStream.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingoffsets", "earliest").
  load

val records = topic1.
  withColumn("eventtime", 'timestamp).  // <-- just to put the right name given 
the purpose
  withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- 
use the renamed eventtime column
  dropDuplicates("value").  // dropDuplicates will use watermark
// only when eventTime column exists
  // include the watermark column => internal design leak?
  select('key cast "string", 'value cast "string", 'eventtime).
  as[(String, String, java.sql.Timestamp)]

scala> records.explain
== Physical Plan ==
*Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
value#170, eventtime#157-T3ms]
+- StreamingDeduplicate [value#1], 
StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0
   +- Exchange hashpartitioning(value#1, 200)
  +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
 +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
+- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, 
offset#4L, timestamp#5, timestampType#6]

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = records.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime("10 seconds")).
  queryName("from-kafka-topic1-to-console").
  outputMode(OutputMode.Update).
  start
{code}

{code}
---
Batch: 0
---
17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: eventtime#157-T3ms
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
at 
org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$class.watermarkPredicateForKeys(statefulOperators.scala:160)
at 
org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.watermarkPredicateForKeys$lzycompute(statefulOperators.scala:350)
at

[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102954#comment-16102954
 ] 

zhoukang commented on SPARK-21544:
--

This is not a serious problem,and actually it depends on the maven central's 
strategy.But after look into pom.xml,i found there are some meaningless 
duplicate executions,so i modify some of them.

> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Writing tracking file 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories
> [DEBUG] Installing 
> org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure

2017-07-27 Thread Jacek Laskowski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-21546:

Summary: dropDuplicates with watermark yields RuntimeException due to 
binding failure  (was: dropDuplicates followed by select yields 
RuntimeException due to binding failure)

> dropDuplicates with watermark yields RuntimeException due to binding failure
> 
>
> Key: SPARK-21546
> URL: https://issues.apache.org/jira/browse/SPARK-21546
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>
> With today's master...
> The following streaming query yields {{RuntimeException}} due to failure in 
> binding (most likely due to {{select}} operator).
> {code}
> val topic1 = spark.
>   readStream.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   option("startingoffsets", "earliest").
>   load
> val records = topic1.
>   withColumn("eventtime", 'timestamp).  // <-- just to put the right name 
> given the purpose
>   withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // 
> <-- use the renamed eventtime column
>   dropDuplicates("value").  // dropDuplicates will use watermark
> // only when eventTime column exists
>   // include the watermark column => internal design leak?
>   select('key cast "string", 'value cast "string", 'eventtime).
>   as[(String, String, java.sql.Timestamp)]
> scala> records.explain
> == Physical Plan ==
> *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
> value#170, eventtime#157-T3ms]
> +- StreamingDeduplicate [value#1], 
> StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0),
>  0
>+- Exchange hashpartitioning(value#1, 200)
>   +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
>  +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
> +- StreamingRelation kafka, [key#0, value#1, topic#2, 
> partition#3, offset#4L, timestamp#5, timestampType#6]
> import org.apache.spark.sql.streaming.{OutputMode, Trigger}
> val sq = records.
>   writeStream.
>   format("console").
>   option("truncate", false).
>   trigger(Trigger.ProcessingTime("10 seconds")).
>   queryName("from-kafka-topic1-to-console").
>   outputMode(OutputMode.Update).
>   start
> {code}
> {code}
> ---
> Batch: 0
> ---
> 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 
> 438)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: eventtime#157-T3ms
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
>   at 
>

[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102950#comment-16102950
 ] 

zhoukang commented on SPARK-21544:
--

This depends on maven central's strategy,whether can override or not. 


> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Writing tracking file 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories
> [DEBUG] Installing 
> org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21546) dropDuplicates followed by select yields RuntimeException due to binding failure

2017-07-27 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-21546:
---

 Summary: dropDuplicates followed by select yields RuntimeException 
due to binding failure
 Key: SPARK-21546
 URL: https://issues.apache.org/jira/browse/SPARK-21546
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jacek Laskowski


With today's master...

The following streaming query yields {{RuntimeException}} due to failure in 
binding (most likely due to {{select}} operator).

{code}
val topic1 = spark.
  readStream.
  format("kafka").
  option("subscribe", "topic1").
  option("kafka.bootstrap.servers", "localhost:9092").
  option("startingoffsets", "earliest").
  load

val records = topic1.
  withColumn("eventtime", 'timestamp).  // <-- just to put the right name given 
the purpose
  withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- 
use the renamed eventtime column
  dropDuplicates("value").  // dropDuplicates will use watermark
// only when eventTime column exists
  // include the watermark column => internal design leak?
  select('key cast "string", 'value cast "string", 'eventtime).
  as[(String, String, java.sql.Timestamp)]

scala> records.explain
== Physical Plan ==
*Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
value#170, eventtime#157-T3ms]
+- StreamingDeduplicate [value#1], 
StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0
   +- Exchange hashpartitioning(value#1, 200)
  +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
 +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
+- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, 
offset#4L, timestamp#5, timestampType#6]

import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val sq = records.
  writeStream.
  format("console").
  option("truncate", false).
  trigger(Trigger.ProcessingTime("10 seconds")).
  queryName("from-kafka-topic1-to-console").
  outputMode(OutputMode.Update).
  start
{code}

{code}
---
Batch: 0
---
17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: eventtime#157-T3ms
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
at 
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
at 
org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at 
org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160)
at scala.Option.map(Option.scala:146)
at

[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102937#comment-16102937
 ] 

zhoukang commented on SPARK-21543:
--

And in my production cluster,there is a case:executor init failed since bad 
disk(can not register to shuffle server), after 4 times,job exited.

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102934#comment-16102934
 ] 

zhoukang commented on SPARK-21543:
--

>From my understanding, executor init failed then actually task has not been 
>launched.

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102929#comment-16102929
 ] 

Sean Owen commented on SPARK-21544:
---

Spark is successfully released to Maven Central by its maintainers.  I am not 
sure it's meant to be something you can release yourself via a different 
mechanism. I'm also not clear what the impact is on the Spark release process.

> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Writing tracking file 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories
> [DEBUG] Installing 
> org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102918#comment-16102918
 ] 

zhoukang commented on SPARK-21544:
--

i have updated log.This will cause we deploy to nexus failed.
Because when execute 'mvn deploy' it will upload tests.jar twice to 
nexus.However , for releases repositories we can not upload twice,then deploy 
will fail.
Workaround is deploy with -pl option.But this is not smart i think.

> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Writing tracking file 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories
> [DEBUG] Installing 
> org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21544:
-
Description: 
For moudle below:
common/network-common
streaming
sql/core
sql/catalyst
tests.jar will install or deploy twice.Like:

{code:java}
[INFO] Installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
[DEBUG] Writing tracking file 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories
[DEBUG] Installing 
org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml
[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
[INFO] Installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
[DEBUG] Skipped re-installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
 seems unchanged
{code}
The reason is below:

{code:java}
[DEBUG]   (f) artifact = 
org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
[DEBUG]   (f) attachedArtifacts = 
[org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
 org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
org.apache.spark:spark
-streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
 org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
-mdh2.1.0.1-SNAPSHOT]
{code}
when executing 'mvn deploy' to nexus during release.I will fail since release 
nexus can not be override.


  was:
For moudle below:
common/network-common
streaming
sql/core
sql/catalyst
tests.jar will install or deploy twice.Like:

{code:java}
[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
[INFO] Installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
[DEBUG] Skipped re-installing 
/home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
 to 
/home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
 seems unchanged
{code}
The reason is below:

{code:java}
[DEBUG]   (f) artifact = 
org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
[DEBUG]   (f) attachedArtifacts = 
[org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
 org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
org.apache.spark:spark
-streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
 org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
-mdh2.1.0.1-SNAPSHOT]
{code}
when executing 'mvn deploy' to nexus during release.I will fail since release 
nexus can not be override.



> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Writing tracking file 
>

[jira] [Resolved] (SPARK-21545) pyspark2

2017-07-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21545.
---
  Resolution: Invalid
   Fix Version/s: (was: 2.2.0)
Target Version/s:   (was: 2.2.0)

Please read http://spark.apache.org/contributing.html
This is not the place for questions.

> pyspark2
> 
>
> Key: SPARK-21545
> URL: https://issues.apache.org/jira/browse/SPARK-21545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: Spark2.2 with CDH5.12,python3.6.1,java jdk1.8_b31.
>Reporter: gumpcheng
>  Labels: cdh, spark2.2
>
> I install spark2.2 following the official steps with CDH5.12.
> Info on Cloudera Manager is okay!
> But I failed to initialize pyspark2.
> My Environment : Python3.6.1,JDK1.8,CDH5.12
> The problem make me crazy for several days.
> And I found no way to solve it.
>  Anyone can help me?
> Very thank you!!!
> [hdfs@Master /data/soft/spark2.2]$ pyspark2
> Python 3.6.1 (default, Jul 27 2017, 11:07:01) 
> [GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:236)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> 17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Attempted to request executors before the AM has registered!
> 17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141)
>   at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485)
>   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1936)
>   at org.apache.spark.SparkContext.(SparkContext.scala:587)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:236)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52:
>  UserWarning: Fall back to non-hive support because failing to access 
> HiveConf, please make

[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102856#comment-16102856
 ] 

Sean Owen commented on SPARK-21544:
---

What is the problem? I don't see two installations from the log you create, and 
not clear what the effect would be.

> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102854#comment-16102854
 ] 

Sean Owen commented on SPARK-21543:
---

Why shouldn't it?

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8

2017-07-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21271:
---

Assignee: Kazuaki Ishizaki

> UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
> ---
>
> Key: SPARK-21271
> URL: https://issues.apache.org/jira/browse/SPARK-21271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0
>
>
> The method is:
> {code}
> public int hashCode() {
> return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, 
> sizeInBytes, 42);
>   }
> {code}
> but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords 
> throws assertion) - for example here: 
> {code}FixedLengthRowBasedKeyValueBatch.appendRow{code}
> The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a 
> prefix that is multiple of 8.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8

2017-07-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21271.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18503
[https://github.com/apache/spark/pull/18503]

> UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
> ---
>
> Key: SPARK-21271
> URL: https://issues.apache.org/jira/browse/SPARK-21271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
> Fix For: 2.3.0
>
>
> The method is:
> {code}
> public int hashCode() {
> return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, 
> sizeInBytes, 42);
>   }
> {code}
> but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords 
> throws assertion) - for example here: 
> {code}FixedLengthRowBasedKeyValueBatch.appendRow{code}
> The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a 
> prefix that is multiple of 8.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19511) insert into table does not work on second session of beeline

2017-07-27 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102804#comment-16102804
 ] 

xinzhang commented on SPARK-19511:
--

[~chenerlu]
hi it always appear . which scene does it do not appear.?

> insert into table does not work on second session of beeline
> 
>
> Key: SPARK-19511
> URL: https://issues.apache.org/jira/browse/SPARK-19511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Centos 7.2, java 1.7.0_91
>Reporter: sanjiv marathe
>
> same issue spark-11083 ...reopen ?
> insert into table works for the first session of beeline; and fails in the 
> second session of beeline.
> Everytime, I had to restart thrift server and connect again to get it working.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21538) Attribute resolution inconsistency in Dataset API

2017-07-27 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102782#comment-16102782
 ] 

Xiao Li commented on SPARK-21538:
-

https://github.com/apache/spark/pull/18740

> Attribute resolution inconsistency in Dataset API
> -
>
> Key: SPARK-21538
> URL: https://issues.apache.org/jira/browse/SPARK-21538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>
> {code}
> spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
> spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
> spark.range(1).withColumnRenamed("id", "x").sort('id) // works
> spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
> (x);
> ...
> {code}
> It looks like the Dataset API functions taking {{String}} use the basic 
> resolver that only look at the columns at that level, whereas all the other 
> means of expressing an attribute are lazily resolved during the analyzer.
> The reason why the first 3 calls work is explained in the docs for {{object 
> ResolveMissingReferences}}:
> {code}
>   /**
>* In many dialects of SQL it is valid to sort by attributes that are not 
> present in the SELECT
>* clause.  This rule detects such queries and adds the required attributes 
> to the original
>* projection, so that they will be available during sorting. Another 
> projection is added to
>* remove these attributes after sorting.
>*
>* The HAVING clause could also used a grouping columns that is not 
> presented in the SELECT.
>*/
> {code}
> For consistency, it would be good to use the same attribute resolution 
> mechanism everywhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect

2017-07-27 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102775#comment-16102775
 ] 

xinzhang commented on SPARK-11083:
--

reappeared in Spark 2.1.0. 
any one working on this issue?

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
>Assignee: Davies Liu
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
>  to destination 
>

[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102767#comment-16102767
 ] 

zhoukang commented on SPARK-21543:
--

I have created a pr https://github.com/apache/spark/pull/18743

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-27 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102766#comment-16102766
 ] 

zhoukang commented on SPARK-21544:
--

I have create a pr: https://github.com/apache/spark/pull/18745

> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

75 matches

Mail list logo