[jira] [Commented] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
[ https://issues.apache.org/jira/browse/SPARK-21554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104472#comment-16104472 ] Hyukjin Kwon commented on SPARK-21554: -- Would you mind describing steps to reproduce this? > Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: > XXX' when run on yarn cluster > -- > > Key: SPARK-21554 > URL: https://issues.apache.org/jira/browse/SPARK-21554 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: We are deploying pyspark scripts on EMR 5.7 >Reporter: Subhod Lagade > > Traceback (most recent call last): > File "Test.py", line 7, in > hc = HiveContext(sc) > File > "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py", > line 514, in __init__ > File > "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py", > line 179, in getOrCreate > File > "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating > 'org.apache.spark.sql.hive.HiveSessionState':" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
Subhod Lagade created SPARK-21554: - Summary: Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster Key: SPARK-21554 URL: https://issues.apache.org/jira/browse/SPARK-21554 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.1.1 Environment: We are deploying pyspark scripts on EMR 5.7 Reporter: Subhod Lagade Traceback (most recent call last): File "Test.py", line 7, in hc = HiveContext(sc) File "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py", line 514, in __init__ File "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py", line 179, in getOrCreate File "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py", line 79, in deco pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21553) Added the description of the default value of master parameter in the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104424#comment-16104424 ] Donghui Xu commented on SPARK-21553: Please review the code on [https://github.com/apache/spark/pull/18755|https://github.com/apache/spark/pull/18755] > Added the description of the default value of master parameter in the > spark-shell > - > > Key: SPARK-21553 > URL: https://issues.apache.org/jira/browse/SPARK-21553 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.2.0 >Reporter: Donghui Xu >Priority: Minor > > When I type spark-shell --help, I find that the default value description for > the master parameter is missing. The user does not know what the default > value is when the master parameter is not included, so we need to add the > master parameter default description to the help information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21548) Support insert into serial columns of table
[ https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104420#comment-16104420 ] Takeshi Yamamuro commented on SPARK-21548: -- It makes some sense to me cuz this is SQL-compliant: http://developer.mimer.com/documentation/html_101/Mimer_SQL_Engine_DocSet/SQL_Statements65.html. Could you make a pr (we better discuss the implementation in github, maybe I think)? > Support insert into serial columns of table > --- > > Key: SPARK-21548 > URL: https://issues.apache.org/jira/browse/SPARK-21548 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: LvDongrong > > When we use the 'insert into ...' statement we can only insert all the > columns into table.But int some cases,our table has many columns and we are > only interest in some of them.So we want to support the statement "insert > into table tbl (column1, column2,...) values (value1, value2, value3,...)". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21509) Add a config to enable adaptive query execution only for the last query execution.
[ https://issues.apache.org/jira/browse/SPARK-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing closed SPARK-21509. Resolution: Won't Fix > Add a config to enable adaptive query execution only for the last query > execution. > --- > > Key: SPARK-21509 > URL: https://issues.apache.org/jira/browse/SPARK-21509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > Feature of adaptive query execution is a good way to avoid generating too > many small files on HDFS, like mentioned in SPARK-16188. > When feature of adaptive query execution is enabled, all shuffles will be > coordinated. The drawbacks: > 1. It's hard to balance the num of reducers(this decides the processing > speed) and file size on HDFS > 2. It generates some unnecessary > shuffles(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L101) > 3. It generates lots of jobs, which have extra cost for scheduling. > We can add a config and enable adaptive query execution only for the last > shuffle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104418#comment-16104418 ] Subhod Lagade commented on SPARK-15345: --- Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster. We are still facing this issue with spark 2.1.1 any update on this is this resolved? > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no
[jira] [Resolved] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.
[ https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21325. -- Resolution: Invalid The affected version you set is 2.3.0 and that looks fixed in the master and the description is not clear to me as well. I am resolving this as invalid for now. > The shell of 'spark-submit' about '--jars' and '--fils', jars and files can > be placed on local and hdfs. > > > Key: SPARK-21325 > URL: https://issues.apache.org/jira/browse/SPARK-21325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.My submit way: > spark-submit --class cn.gxl.TestSql{color:red} --jars > hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar > --files > hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} > hdfs://nameservice:/gxl/spark_2.0.2_project.jar > 2.spark-submit description: > --jars {color:red}JARS Comma-separated list of local jars{color} to include > on the driver and executor classpaths. > --files{color:red} FILES Comma-separated list of files{color} to be placed in > the working directory of each executor. File paths of these files in > executors can be accessed via SparkFiles.get(fileName). > 3.Problem Description: > {color:red} jars and files Not only can be placed on local but also can be > placed on hdfs. > The description of '' - jars '', that can only be placed on local.This is > wrong > The description of '--files' is not clear that can be placed locally or > hdfs.This is blurry. Not conducive to the developer to understand and > use.*{color} > So, this is an optimization feature that deserves to be modified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.
[ https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-21325: -- > The shell of 'spark-submit' about '--jars' and '--fils', jars and files can > be placed on local and hdfs. > > > Key: SPARK-21325 > URL: https://issues.apache.org/jira/browse/SPARK-21325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.My submit way: > spark-submit --class cn.gxl.TestSql{color:red} --jars > hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar > --files > hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} > hdfs://nameservice:/gxl/spark_2.0.2_project.jar > 2.spark-submit description: > --jars {color:red}JARS Comma-separated list of local jars{color} to include > on the driver and executor classpaths. > --files{color:red} FILES Comma-separated list of files{color} to be placed in > the working directory of each executor. File paths of these files in > executors can be accessed via SparkFiles.get(fileName). > 3.Problem Description: > {color:red} jars and files Not only can be placed on local but also can be > placed on hdfs. > The description of '' - jars '', that can only be placed on local.This is > wrong > The description of '--files' is not clear that can be placed locally or > hdfs.This is blurry. Not conducive to the developer to understand and > use.*{color} > So, this is an optimization feature that deserves to be modified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.
[ https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104410#comment-16104410 ] guoxiaolongzte commented on SPARK-21325: https://issues.apache.org/jira/browse/SPARK-21012 Help me put this jira off,thanks. > The shell of 'spark-submit' about '--jars' and '--fils', jars and files can > be placed on local and hdfs. > > > Key: SPARK-21325 > URL: https://issues.apache.org/jira/browse/SPARK-21325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.My submit way: > spark-submit --class cn.gxl.TestSql{color:red} --jars > hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar > --files > hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} > hdfs://nameservice:/gxl/spark_2.0.2_project.jar > 2.spark-submit description: > --jars {color:red}JARS Comma-separated list of local jars{color} to include > on the driver and executor classpaths. > --files{color:red} FILES Comma-separated list of files{color} to be placed in > the working directory of each executor. File paths of these files in > executors can be accessed via SparkFiles.get(fileName). > 3.Problem Description: > {color:red} jars and files Not only can be placed on local but also can be > placed on hdfs. > The description of '' - jars '', that can only be placed on local.This is > wrong > The description of '--files' is not clear that can be placed locally or > hdfs.This is blurry. Not conducive to the developer to understand and > use.*{color} > So, this is an optimization feature that deserves to be modified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21553) Added the description of the default value of master parameter in the spark-shell
Donghui Xu created SPARK-21553: -- Summary: Added the description of the default value of master parameter in the spark-shell Key: SPARK-21553 URL: https://issues.apache.org/jira/browse/SPARK-21553 Project: Spark Issue Type: Improvement Components: Spark Shell Affects Versions: 2.2.0 Reporter: Donghui Xu Priority: Minor When I type spark-shell --help, I find that the default value description for the master parameter is missing. The user does not know what the default value is when the master parameter is not included, so we need to add the master parameter default description to the help information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.
[ https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104402#comment-16104402 ] Hyukjin Kwon commented on SPARK-21325: -- [~guoxiaolongzte], Would you point out which JIRA this one duplicates? > The shell of 'spark-submit' about '--jars' and '--fils', jars and files can > be placed on local and hdfs. > > > Key: SPARK-21325 > URL: https://issues.apache.org/jira/browse/SPARK-21325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.My submit way: > spark-submit --class cn.gxl.TestSql{color:red} --jars > hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar > --files > hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} > hdfs://nameservice:/gxl/spark_2.0.2_project.jar > 2.spark-submit description: > --jars {color:red}JARS Comma-separated list of local jars{color} to include > on the driver and executor classpaths. > --files{color:red} FILES Comma-separated list of files{color} to be placed in > the working directory of each executor. File paths of these files in > executors can be accessed via SparkFiles.get(fileName). > 3.Problem Description: > {color:red} jars and files Not only can be placed on local but also can be > placed on hdfs. > The description of '' - jars '', that can only be placed on local.This is > wrong > The description of '--files' is not clear that can be placed locally or > hdfs.This is blurry. Not conducive to the developer to understand and > use.*{color} > So, this is an optimization feature that deserves to be modified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21552) Add decimal type support to ArrowWriter.
Takuya Ueshin created SPARK-21552: - Summary: Add decimal type support to ArrowWriter. Key: SPARK-21552 URL: https://issues.apache.org/jira/browse/SPARK-21552 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin Decimal type is not yet supported in ArrowWriter. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104373#comment-16104373 ] roncenzhao commented on SPARK-17321: Hi, I encounter this problem too. Any process about this bug? Thanks~ > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
[ https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-21306. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 2.1.2 2.0.3 > OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier > > > Key: SPARK-21306 > URL: https://issues.apache.org/jira/browse/SPARK-21306 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Cathal Garvey >Assignee: Yan Facai (颜发才) >Priority: Critical > Labels: classification, ml > Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0 > > > Hi folks, thanks for Spark! :) > I've been learning to use `ml` and `mllib`, and I've encountered a block > while trying to use `ml.classification.OneVsRest` with > `ml.classification.LogisticRegression`. Basically, [here in the > code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320], > only two columns are being extracted and fed to the underlying classifiers.. > however with some configurations, more than two columns are required. > Specifically: I want to do multiclass learning with Logistic Regression, on a > very imbalanced dataset. In my dataset, I have lots of imbalances, so I was > planning to use weights. I set a column, `"weight"`, as the inverse frequency > of each field, and I configured my `LogisticRegression` class to use this > column, then put it in a `OneVsRest` wrapper. > However, `OneVsRest` strips all but two columns out of a dataset before > training, so I get an error from within `LogisticRegression` that it can't > find the `"weight"` column. > It would be nice to have this fixed! I can see a few ways, but a very > conservative fix would be to include a parameter in `OneVsRest.fit` for > additional columns to `select` before passing to the underlying model. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104284#comment-16104284 ] Kazuaki Ishizaki commented on SPARK-18016: -- [~jamcon] Thank you reporting the problem. We fixed a problem for the large number (e.g. 4000) of columns. However, we know that we have not solved a problem for the very large number (e.g. 12000) of columns. I have just pinged the author that created the fix to solve these two problems. > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at >
[jira] [Closed] (SPARK-21325) The shell of 'spark-submit' about '--jars' and '--fils', jars and files can be placed on local and hdfs.
[ https://issues.apache.org/jira/browse/SPARK-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte closed SPARK-21325. -- duplicate > The shell of 'spark-submit' about '--jars' and '--fils', jars and files can > be placed on local and hdfs. > > > Key: SPARK-21325 > URL: https://issues.apache.org/jira/browse/SPARK-21325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.My submit way: > spark-submit --class cn.gxl.TestSql{color:red} --jars > hdfs://nameservice:/gxl/spark-core_2.11-2.3.0-SNAPSHOT.jar,hdfs://nameservice:/gxl/zookeeper-3.4.6.jar > --files > hdfs://nameservice:/gxl/value1.txt,hdfs://nameservice:/gxl/value2.txt{color} > hdfs://nameservice:/gxl/spark_2.0.2_project.jar > 2.spark-submit description: > --jars {color:red}JARS Comma-separated list of local jars{color} to include > on the driver and executor classpaths. > --files{color:red} FILES Comma-separated list of files{color} to be placed in > the working directory of each executor. File paths of these files in > executors can be accessed via SparkFiles.get(fileName). > 3.Problem Description: > {color:red} jars and files Not only can be placed on local but also can be > placed on hdfs. > The description of '' - jars '', that can only be placed on local.This is > wrong > The description of '--files' is not clear that can be placed locally or > hdfs.This is blurry. Not conducive to the developer to understand and > use.*{color} > So, this is an optimization feature that deserves to be modified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103202#comment-16103202 ] xinzhang edited comment on SPARK-21067 at 7/28/17 1:06 AM: --- same here. problem reappeared in Spark 2.1.0 thriftserver : Open Beeline Session 1 Create Table 1 (Success) Open Beeline Session 2 Create Table 2 (Success) Close Beeline Session 1 Create Table 3 in Beeline Session 2 (FAIL) use parquet, the issue is not present . [~cloud_fan] was (Author: zhangxin0112zx): same here. problem reappeared in Spark 2.1.0 thriftserver : Open Beeline Session 1 Create Table 1 (Success) Open Beeline Session 2 Create Table 2 (Success) Close Beeline Session 1 Create Table 3 in Beeline Session 2 (FAIL) use parquet, the issue is not present . @Wenchen Fan > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at >
[jira] [Resolved] (SPARK-21538) Attribute resolution inconsistency in Dataset API
[ https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21538. - Resolution: Fixed Assignee: Anton Okolnychyi Fix Version/s: 2.3.0 2.2.1 > Attribute resolution inconsistency in Dataset API > - > > Key: SPARK-21538 > URL: https://issues.apache.org/jira/browse/SPARK-21538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu >Assignee: Anton Okolnychyi > Fix For: 2.2.1, 2.3.0 > > > {code} > spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works > spark.range(1).withColumnRenamed("id", "x").sort($"id") // works > spark.range(1).withColumnRenamed("id", "x").sort('id) // works > spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: > org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among > (x); > ... > {code} > It looks like the Dataset API functions taking {{String}} use the basic > resolver that only look at the columns at that level, whereas all the other > means of expressing an attribute are lazily resolved during the analyzer. > The reason why the first 3 calls work is explained in the docs for {{object > ResolveMissingReferences}}: > {code} > /** >* In many dialects of SQL it is valid to sort by attributes that are not > present in the SELECT >* clause. This rule detects such queries and adds the required attributes > to the original >* projection, so that they will be available during sorting. Another > projection is added to >* remove these attributes after sorting. >* >* The HAVING clause could also used a grouping columns that is not > presented in the SELECT. >*/ > {code} > For consistency, it would be good to use the same attribute resolution > mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103853#comment-16103853 ] James Conner edited comment on SPARK-18016 at 7/27/17 8:31 PM: --- The issue does not appear to be completely fixed. I downloaded the master today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) Regression on a dataframe which contains 2658 columns. I'm still getting the same constant pool exceeding JVM 0x error. The steps that I'm using to generate the error are: {code:java} // Imports import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor} import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel} import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column} // Load data val mainDF = spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024) // Impute data val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE") val outColsMain = inColsMain.map(i=>(i+"_imputed")) val mainImputer = new Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean") val mainImputerBuild = mainImputer.fit(mainDF) val imputedMainDF = mainImputerBuild.transform(mainDF) // Drop original feature columns, retain imputed val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => !inColsMain.contains(colName)).map(colName => new Column(colName)): _*) // Split data for testing & training val Array(trainDF, testDF) = fullData.randomSplit(Array(0.80, 0.20),seed = 12345) // Vector Assembler val arrayName = fullData.columns.filter(_ != "ID").filter(_ != "SCORE") val assembler = new VectorAssembler().setInputCols(arrayName).setOutputCol("features") // GBT Object val gbt = new GBTRegressor().setLabelCol("SCORE").setFeaturesCol("features").setMaxIter(5).setSeed(1993).setLossType("squared").setSubsamplingRate(1) // Pipeline Object val pipeline = new Pipeline().setStages(Array(assembler, gbt)) // Hyper Parameter Grid Object val paramGrid = new ParamGridBuilder().addGrid(gbt.maxBins, Array(2, 4, 8)).addGrid(gbt.maxDepth, Array(1, 2, 4)).addGrid(gbt.stepSize, Array(0.1, 0.2)).build() // Evaluator Object val evaluator = new RegressionEvaluator().setLabelCol("SCORE").setPredictionCol("prediction").setMetricName("rmse") // Cross Validation Object val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5) // Build the model val gbtModel = cv.fit(trainDF) {code} Upon building the model, it errors out with the following cause: {code:java} java.lang.RuntimeException: Error while encoding: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2 has grown past JVM limit of 0x {code} was (Author: jamcon): The issue does not appear to be completely fixed. I downloaded the master today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) Regression on a dataframe which contains 2658 columns. I'm still getting the same constant pool exceeding JVM 0x error. The steps that I'm using to generate the error are: {code:scala} // Imports import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor} import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel} import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column} // Load data val mainDF = spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024) // Impute data val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE") val outColsMain = inColsMain.map(i=>(i+"_imputed")) val mainImputer = new Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean") val mainImputerBuild = mainImputer.fit(mainDF) val imputedMainDF = mainImputerBuild.transform(mainDF) // Drop original feature columns, retain imputed val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => !inColsMain.contains(colName)).map(colName => new Column(colName)): _*) // Split data for testing & training val Array(trainDF, testDF)
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103853#comment-16103853 ] James Conner commented on SPARK-18016: -- The issue does not appear to be completely fixed. I downloaded the master today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) Regression on a dataframe which contains 2658 columns. I'm still getting the same constant pool exceeding JVM 0x error. The steps that I'm using to generate the error are: {code:scala} // Imports import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor} import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel} import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column} // Load data val mainDF = spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024) // Impute data val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE") val outColsMain = inColsMain.map(i=>(i+"_imputed")) val mainImputer = new Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean") val mainImputerBuild = mainImputer.fit(mainDF) val imputedMainDF = mainImputerBuild.transform(mainDF) // Drop original feature columns, retain imputed val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => !inColsMain.contains(colName)).map(colName => new Column(colName)): _*) // Split data for testing & training val Array(trainDF, testDF) = fullData.randomSplit(Array(0.80, 0.20),seed = 12345) // Vector Assembler val arrayName = fullData.columns.filter(_ != "ID").filter(_ != "SCORE") val assembler = new VectorAssembler().setInputCols(arrayName).setOutputCol("features") // GBT Object val gbt = new GBTRegressor().setLabelCol("SCORE").setFeaturesCol("features").setMaxIter(5).setSeed(1993).setLossType("squared").setSubsamplingRate(1) // Pipeline Object val pipeline = new Pipeline().setStages(Array(assembler, gbt)) // Hyper Parameter Grid Object val paramGrid = new ParamGridBuilder().addGrid(gbt.maxBins, Array(2, 4, 8)).addGrid(gbt.maxDepth, Array(1, 2, 4)).addGrid(gbt.stepSize, Array(0.1, 0.2)).build() // Evaluator Object val evaluator = new RegressionEvaluator().setLabelCol("SCORE").setPredictionCol("prediction").setMetricName("rmse") // Cross Validation Object val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5) // Build the model val gbtModel = cv.fit(trainDF) {code} Upon building the model, it errors out with the following cause: {code:java} java.lang.RuntimeException: Error while encoding: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2 has grown past JVM limit of 0x {code} > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at
[jira] [Commented] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data
[ https://issues.apache.org/jira/browse/SPARK-21550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103629#comment-16103629 ] Sean Owen commented on SPARK-21550: --- I think this was fixed in https://issues.apache.org/jira/browse/SPARK-19573 > approxQuantiles throws "next on empty iterator" on empty data > - > > Key: SPARK-21550 > URL: https://issues.apache.org/jira/browse/SPARK-21550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: peay > > The documentation says: > {code} > null and NaN values will be removed from the numerical column before > calculation. If > the dataframe is empty or the column only contains null or NaN, an empty > array is returned. > {code} > However, this small pyspark example > {code} > sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], > 0.001) > {code} > throws > {code} > Py4JJavaError: An error occurred while calling o3493.approxQuantile. > : java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.last(TraversableLike.scala:431) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) > at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data
[ https://issues.apache.org/jira/browse/SPARK-21550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peay closed SPARK-21550. Resolution: Duplicate Fix Version/s: 2.2.0 > approxQuantiles throws "next on empty iterator" on empty data > - > > Key: SPARK-21550 > URL: https://issues.apache.org/jira/browse/SPARK-21550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: peay > Fix For: 2.2.0 > > > The documentation says: > {code} > null and NaN values will be removed from the numerical column before > calculation. If > the dataframe is empty or the column only contains null or NaN, an empty > array is returned. > {code} > However, this small pyspark example > {code} > sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], > 0.001) > {code} > throws > {code} > Py4JJavaError: An error occurred while calling o3493.approxQuantile. > : java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at > scala.collection.TraversableLike$class.last(TraversableLike.scala:431) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) > at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103547#comment-16103547 ] yuhao yang commented on SPARK-21535: It's not in my opinion. https://issues.apache.org/jira/browse/SPARK-21086 is trying to store all the trained models in the TrainValidationSplitModel or CrossValidatorModel according to the discussion, and with a control parameter which is turned off by default. Anyway changing the training process hardly has an impact on that. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103543#comment-16103543 ] Reynold Xin commented on SPARK-21551: - Sure. > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Priority: Critical > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103539#comment-16103539 ] peay commented on SPARK-21551: -- Sure, does 15 seconds sound good? > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Priority: Critical > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103536#comment-16103536 ] Reynold Xin commented on SPARK-21551: - Do you want to submit a pull request? > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Priority: Critical > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
peay created SPARK-21551: Summary: pyspark's collect fails when getaddrinfo is too slow Key: SPARK-21551 URL: https://issues.apache.org/jira/browse/SPARK-21551 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: peay Priority: Critical Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and {{DataFrame.collect}} all work by starting an ephemeral server in the driver, and having Python connect to it to download the data. All three are implemented along the lines of: {code} port = self._jdf.collectToPython() return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( {code} The server has **a hardcoded timeout of 3 seconds** (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) -- i.e., the Python process has 3 seconds to connect to it from the very moment the driver server starts. In general, that seems fine, but I have been encountering frequent timeouts leading to `Exception: could not open socket`. After investigating a bit, it turns out that {{_load_from_socket}} makes a call to {{getaddrinfo}}: {code} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM): .. connect .. {code} I am not sure why, but while most such calls to {{getaddrinfo}} on my machine only take a couple milliseconds, about 10% of them take between 2 and 10 seconds, leading to about 10% of jobs failing. I don't think we can always expect {{getaddrinfo}} to return instantly. More generally, Python may sometimes pause for a couple seconds, which may not leave enough time for the process to connect to the server. Especially since the server timeout is hardcoded, I think it would be best to set a rather generous value (15 seconds?) to avoid such situations. A {{getaddrinfo}} specific fix could avoid doing it every single time, or do it before starting up the driver server. cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21550) approxQuantiles throws "next on empty iterator" on empty data
peay created SPARK-21550: Summary: approxQuantiles throws "next on empty iterator" on empty data Key: SPARK-21550 URL: https://issues.apache.org/jira/browse/SPARK-21550 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: peay The documentation says: {code} null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned. {code} However, this small pyspark example {code} sql_context.range(10).filter(col("id") == 42).approxQuantile("id", [0.99], 0.001) {code} throws {code} Py4JJavaError: An error occurred while calling o3493.approxQuantile. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) at scala.collection.TraversableLike$class.last(TraversableLike.scala:431) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132) at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(StatFunctions.scala:92) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(StatFunctions.scala:92) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21549) Spark fails to abort job correctly in case of custom OutputFormat implementations
[ https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-21549: --- Description: Spark fails to abort job correctly in case of custom OutputFormat implementations. There are OutputFormat implementations which do not need to use *mapreduce.output.fileoutputformat.outputdir* standard hadoop property. [But spark reads this property from the configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79] while setting up an OutputCommitter {code:javascript} val committer = FileCommitProtocol.instantiate( className = classOf[HadoopMapReduceCommitProtocol].getName, jobId = stageId.toString, outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"), isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol] committer.setupJob(jobContext) {code} In that case if job fails Spark executes [committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106] {code:javascript} committer.abortJob(jobContext) {code} ... and fails with the following exception {code} Can not create a Path from a null string java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123) at org.apache.hadoop.fs.Path.(Path.java:135) at org.apache.hadoop.fs.Path.(Path.java:89) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084) {code} was: Spark fails to abort job correctly in case of custom OutputFormat implementations. There are OutputFormat implementations which do not need to use *mapreduce.output.fileoutputformat.outputdir* standard hadoop property. [But spark reads this property from the configuration.|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79] while setting up an OutputCommitter {code:javascript} val committer = FileCommitProtocol.instantiate( className = classOf[HadoopMapReduceCommitProtocol].getName, jobId = stageId.toString, outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"), isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol] committer.setupJob(jobContext) {code} In that case if job fails Spark executes [committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106] {code:javascript} committer.abortJob(jobContext) {code} ... and fails with the following exception {code} Can not create a Path from a null string java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123) at org.apache.hadoop.fs.Path.(Path.java:135) at org.apache.hadoop.fs.Path.(Path.java:89) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
[jira] [Created] (SPARK-21549) Spark fails to abort job correctly in case of custom OutputFormat implementations
Sergey Zhemzhitsky created SPARK-21549: -- Summary: Spark fails to abort job correctly in case of custom OutputFormat implementations Key: SPARK-21549 URL: https://issues.apache.org/jira/browse/SPARK-21549 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Environment: spark 2.2.0 scala 2.11 Reporter: Sergey Zhemzhitsky Priority: Critical Spark fails to abort job correctly in case of custom OutputFormat implementations. There are OutputFormat implementations which do not need to use *mapreduce.output.fileoutputformat.outputdir* standard hadoop property. [But spark reads this property from the configuration.|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79] while setting up an OutputCommitter {code:javascript} val committer = FileCommitProtocol.instantiate( className = classOf[HadoopMapReduceCommitProtocol].getName, jobId = stageId.toString, outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"), isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol] committer.setupJob(jobContext) {code} In that case if job fails Spark executes [committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106] {code:javascript} committer.abortJob(jobContext) {code} ... and fails with the following exception {code} Can not create a Path from a null string java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123) at org.apache.hadoop.fs.Path.(Path.java:135) at org.apache.hadoop.fs.Path.(Path.java:89) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141) at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time
[ https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103459#comment-16103459 ] DjvuLee commented on SPARK-21547: - Yes, I agree that this has a relationship with the work, but doing nothing about 3min is too long for a Streaming Application. My proposal is try to let us to inspect whether the current cleaner strategy is good enough. > Spark cleaner cost too many time > > > Key: SPARK-21547 > URL: https://issues.apache.org/jira/browse/SPARK-21547 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: DjvuLee > > Spark Streaming sometime cost so many time deal with cleaning, and this can > become worse when enable the dynamic allocation. > I post the Driver's Log in the following comments, we can find that the > cleaner costs more than 2min. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21548) Support insert into serial columns of table
[ https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103336#comment-16103336 ] LvDongrong commented on SPARK-21548: we try to solve it through this way: https://github.com/lvdongr/spark/pull/1 > Support insert into serial columns of table > --- > > Key: SPARK-21548 > URL: https://issues.apache.org/jira/browse/SPARK-21548 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: LvDongrong > > When we use the 'insert into ...' statement we can only insert all the > columns into table.But int some cases,our table has many columns and we are > only interest in some of them.So we want to support the statement "insert > into table tbl (column1, column2,...) values (value1, value2, value3,...)". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21319) UnsafeExternalRowSorter.RowComparator memory leak
[ https://issues.apache.org/jira/browse/SPARK-21319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21319. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18679 [https://github.com/apache/spark/pull/18679] > UnsafeExternalRowSorter.RowComparator memory leak > - > > Key: SPARK-21319 > URL: https://issues.apache.org/jira/browse/SPARK-21319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: James Baker > Fix For: 2.3.0 > > Attachments: > 0001-SPARK-21319-Fix-memory-leak-in-UnsafeExternalRowSort.patch, hprof.png > > > When we wish to sort within partitions, we produce an > UnsafeExternalRowSorter. This contains an UnsafeExternalSorter, which > contains the UnsafeExternalRowComparator. > The UnsafeExternalSorter adds a task completion listener which performs any > additional required cleanup. The upshot of this is that we maintain a > reference to the UnsafeExternalRowSorter.RowComparator until the end of the > task. > The RowComparator looks like > {code:java} > private static final class RowComparator extends RecordComparator { > private final Ordering ordering; > private final int numFields; > private final UnsafeRow row1; > private final UnsafeRow row2; > RowComparator(Ordering ordering, int numFields) { > this.numFields = numFields; > this.row1 = new UnsafeRow(numFields); > this.row2 = new UnsafeRow(numFields); > this.ordering = ordering; > } > @Override > public int compare(Object baseObj1, long baseOff1, Object baseObj2, long > baseOff2) { > // TODO: Why are the sizes -1? > row1.pointTo(baseObj1, baseOff1, -1); > row2.pointTo(baseObj2, baseOff2, -1); > return ordering.compare(row1, row2); > } > } > {code} > which means that this will contain references to the last baseObjs that were > passed in, and without tracking them for purposes of memory allocation. > We have a job which sorts within partitions and then coalesces partitions - > this has a tendency to OOM because of the references to old UnsafeRows that > were used during the sorting. > Attached is a screenshot of a memory dump during a task - our JVM has two > executor threads. > It can be seen that we have 2 references inside of row iterators, and 11 more > which are only known in the task completion listener or as part of memory > management. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21319) UnsafeExternalRowSorter.RowComparator memory leak
[ https://issues.apache.org/jira/browse/SPARK-21319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21319: --- Assignee: Wenchen Fan > UnsafeExternalRowSorter.RowComparator memory leak > - > > Key: SPARK-21319 > URL: https://issues.apache.org/jira/browse/SPARK-21319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: James Baker >Assignee: Wenchen Fan > Fix For: 2.3.0 > > Attachments: > 0001-SPARK-21319-Fix-memory-leak-in-UnsafeExternalRowSort.patch, hprof.png > > > When we wish to sort within partitions, we produce an > UnsafeExternalRowSorter. This contains an UnsafeExternalSorter, which > contains the UnsafeExternalRowComparator. > The UnsafeExternalSorter adds a task completion listener which performs any > additional required cleanup. The upshot of this is that we maintain a > reference to the UnsafeExternalRowSorter.RowComparator until the end of the > task. > The RowComparator looks like > {code:java} > private static final class RowComparator extends RecordComparator { > private final Ordering ordering; > private final int numFields; > private final UnsafeRow row1; > private final UnsafeRow row2; > RowComparator(Ordering ordering, int numFields) { > this.numFields = numFields; > this.row1 = new UnsafeRow(numFields); > this.row2 = new UnsafeRow(numFields); > this.ordering = ordering; > } > @Override > public int compare(Object baseObj1, long baseOff1, Object baseObj2, long > baseOff2) { > // TODO: Why are the sizes -1? > row1.pointTo(baseObj1, baseOff1, -1); > row2.pointTo(baseObj2, baseOff2, -1); > return ordering.compare(row1, row2); > } > } > {code} > which means that this will contain references to the last baseObjs that were > passed in, and without tracking them for purposes of memory allocation. > We have a job which sorts within partitions and then coalesces partitions - > this has a tendency to OOM because of the references to old UnsafeRows that > were used during the sorting. > Attached is a screenshot of a memory dump during a task - our JVM has two > executor threads. > It can be seen that we have 2 references inside of row iterators, and 11 more > which are only known in the task completion listener or as part of memory > management. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21539) Job should not be aborted when dynamic allocation is enabled or spark.executor.instances larger then current allocated number by yarn
[ https://issues.apache.org/jira/browse/SPARK-21539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang closed SPARK-21539. Resolution: Duplicate > Job should not be aborted when dynamic allocation is enabled or > spark.executor.instances larger then current allocated number by yarn > - > > Key: SPARK-21539 > URL: https://issues.apache.org/jira/browse/SPARK-21539 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For spark on yarn. > Right now, when TaskSet can not run on any node or host.Which means > blacklistedEverywhere is true in TaskSetManager#abortIfCompleteBlacklisted. > However, if dynamic allocation is enabled, we should wait for yarn to > allocate new nodemanager in order to execute job successfully. > How to reproduce? > 1、Set up a yarn cluster with 5 nodes.And assign a node1 with much larger cpu > core and memory,which can let yarn launch container on this node even it is > blacklisted by TaskScheduler. > 2、modify BlockManager#registerWithExternalShuffleServer > {code:java} > logInfo("Registering executor with local external shuffle service.") > val shuffleConfig = new ExecutorShuffleInfo( > diskBlockManager.localDirs.map(_.toString), > diskBlockManager.subDirsPerLocalDir, > shuffleManager.getClass.getName) > val MAX_ATTEMPTS = conf.get(config.SHUFFLE_REGISTRATION_MAX_ATTEMPTS) > val SLEEP_TIME_SECS = 5 > for (i <- 1 to MAX_ATTEMPTS) { > try { > {color:red}if (shuffleId.host.equals("node1's address")) { > throw new Exception > }{color} > // Synchronous and will throw an exception if we cannot connect. > > shuffleClient.asInstanceOf[ExternalShuffleClient].registerWithShuffleServer( > shuffleServerId.host, shuffleServerId.port, > shuffleServerId.executorId, shuffleConfig) > return > } catch { > case e: Exception if i < MAX_ATTEMPTS => > logError(s"Failed to connect to external shuffle server, will retry > ${MAX_ATTEMPTS - i}" > + s" more times after waiting $SLEEP_TIME_SECS seconds...", e) > Thread.sleep(SLEEP_TIME_SECS * 1000) > case NonFatal(e) => > throw new SparkException("Unable to register with external shuffle > server due to : " + > e.getMessage, e) > } > } > {code} > add logic in red. > 3、set shuffle service enable as true and open shuffle service for yarn. > Then yarn will always launch executor on node1 but failed since shuffle > service can not register success. > Then job will be aborted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19270) Add summary table to GLM summary
[ https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-19270. - Resolution: Fixed Fix Version/s: 2.3.0 > Add summary table to GLM summary > > > Key: SPARK-19270 > URL: https://issues.apache.org/jira/browse/SPARK-19270 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Wayne Zhang >Assignee: Wayne Zhang >Priority: Minor > Fix For: 2.3.0 > > > Add R-like summary table to GLM summary, which includes feature name (if > exist), parameter estimate, standard error, t-stat and p-value. This allows > scala users to easily gather these commonly used inference results. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103228#comment-16103228 ] Rob Genova commented on SPARK-19700: FYI, The Apache Spark fork enhanced to support HashiCorp Nomad as a scheduler is now located at: https://github.com/hashicorp/nomad-spark. If you are interested in trying it out, the best place to get started is: https://www.nomadproject.io/guides/spark/spark.html. > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20992) Support for Nomad as scheduler backend
[ https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103224#comment-16103224 ] Rob Genova commented on SPARK-20992: The Apache Spark fork enhanced to support HashiCorp Nomad as a scheduler is now located at: https://github.com/hashicorp/nomad-spark. If you are interested in trying it out, the best place to get started is here: https://www.nomadproject.io/guides/spark/spark.html. > Support for Nomad as scheduler backend > -- > > Key: SPARK-20992 > URL: https://issues.apache.org/jira/browse/SPARK-20992 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Ben Barnard > > It is convenient to have scheduler backend support for running applications > on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so > that users can run Spark applications on a Nomad cluster without the need to > bring up a Spark Standalone cluster in the Nomad cluster. > Both client and cluster deploy modes should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103202#comment-16103202 ] xinzhang commented on SPARK-21067: -- same here. problem reappeared in Spark 2.1.0 thriftserver : Open Beeline Session 1 Create Table 1 (Success) Open Beeline Session 2 Create Table 2 (Success) Close Beeline Session 1 Create Table 3 in Beeline Session 2 (FAIL) use parquet, the issue is not present . Wenchen Fan > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at >
[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103202#comment-16103202 ] xinzhang edited comment on SPARK-21067 at 7/27/17 1:27 PM: --- same here. problem reappeared in Spark 2.1.0 thriftserver : Open Beeline Session 1 Create Table 1 (Success) Open Beeline Session 2 Create Table 2 (Success) Close Beeline Session 1 Create Table 3 in Beeline Session 2 (FAIL) use parquet, the issue is not present . @Wenchen Fan was (Author: zhangxin0112zx): same here. problem reappeared in Spark 2.1.0 thriftserver : Open Beeline Session 1 Create Table 1 (Success) Open Beeline Session 2 Create Table 2 (Success) Close Beeline Session 1 Create Table 3 in Beeline Session 2 (FAIL) use parquet, the issue is not present . Wenchen Fan > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at >
[jira] [Created] (SPARK-21548) Support insert into serial columns of table
LvDongrong created SPARK-21548: -- Summary: Support insert into serial columns of table Key: SPARK-21548 URL: https://issues.apache.org/jira/browse/SPARK-21548 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: LvDongrong When we use the 'insert into ...' statement we can only insert all the columns into table.But int some cases,our table has many columns and we are only interest in some of them.So we want to support the statement "insert into table tbl (column1, column2,...) values (value1, value2, value3,...)". -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103152#comment-16103152 ] Nick Pentreath commented on SPARK-21535: Isn't this in direct opposition to https://issues.apache.org/jira/browse/SPARK-21086? > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21440) Refactor ArrowConverters and add ArrayType and StructType support.
[ https://issues.apache.org/jira/browse/SPARK-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21440. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18655 [https://github.com/apache/spark/pull/18655] > Refactor ArrowConverters and add ArrayType and StructType support. > -- > > Key: SPARK-21440 > URL: https://issues.apache.org/jira/browse/SPARK-21440 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > Fix For: 2.3.0 > > > This is a refactoring of {{ArrowConverters}} and related classes. > # Refactor {{ColumnWriter}} as {{ArrowWriter}}. > # Add {{ArrayType}} and {{StructType}} support. > # Refactor {{ArrowConverters}} to skip intermediate {{ArrowRecordBatch}} > creation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21440) Refactor ArrowConverters and add ArrayType and StructType support.
[ https://issues.apache.org/jira/browse/SPARK-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21440: --- Assignee: Takuya Ueshin > Refactor ArrowConverters and add ArrayType and StructType support. > -- > > Key: SPARK-21440 > URL: https://issues.apache.org/jira/browse/SPARK-21440 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.3.0 > > > This is a refactoring of {{ArrowConverters}} and related classes. > # Refactor {{ColumnWriter}} as {{ArrowWriter}}. > # Add {{ArrayType}} and {{StructType}} support. > # Refactor {{ArrowConverters}} to skip intermediate {{ArrowRecordBatch}} > creation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts
[ https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103035#comment-16103035 ] Stavros Kontopoulos edited comment on SPARK-15142 at 7/27/17 10:12 AM: --- [~devaraj.k] Thnx. Btw I was not able to reproduce the problem with drivers being queued. After master was up then things went fine. I was able to launch more drivers. https://gist.github.com/skonto/0e23af643e7271e0125e321b04e630a2 was (Author: skonto): [~devaraj.k] Thnx. Btw I was not able to reproduce the problem with drivers being queued. After master was up then things went fine. I was able to launch more drivers. https://gist.github.com/skonto/0e23af643e7271e0125e321b04e630a2 > Spark Mesos dispatcher becomes unusable when the Mesos master restarts > -- > > Key: SPARK-15142 > URL: https://issues.apache.org/jira/browse/SPARK-15142 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Reporter: Devaraj K >Priority: Minor > Attachments: > spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out > > > While Spark Mesos dispatcher running if the Mesos master gets restarted then > Spark Mesos dispatcher will keep running and queues up all the submitted > applications and will not launch them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts
[ https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103035#comment-16103035 ] Stavros Kontopoulos commented on SPARK-15142: - [~devaraj.k] Thnx. Btw I was not able to reproduce the problem with drivers being queued. After master was up then things went fine. I was able to launch more drivers. https://gist.github.com/skonto/0e23af643e7271e0125e321b04e630a2 > Spark Mesos dispatcher becomes unusable when the Mesos master restarts > -- > > Key: SPARK-15142 > URL: https://issues.apache.org/jira/browse/SPARK-15142 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Reporter: Devaraj K >Priority: Minor > Attachments: > spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out > > > While Spark Mesos dispatcher running if the Mesos master gets restarted then > Spark Mesos dispatcher will keep running and queues up all the submitted > applications and will not launch them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time
[ https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103021#comment-16103021 ] Sean Owen commented on SPARK-21547: --- It's not clear that this is "too slow", relative to whatever work it's doing. Why is it unnecessarily slow, or what change are you proposing? > Spark cleaner cost too many time > > > Key: SPARK-21547 > URL: https://issues.apache.org/jira/browse/SPARK-21547 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: DjvuLee > > Spark Streaming sometime cost so many time deal with cleaning, and this can > become worse when enable the dynamic allocation. > I post the Driver's Log in the following comments, we can find that the > cleaner costs more than 2min. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21547) Spark cleaner cost too many time
[ https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-21547: Description: Spark Streaming sometime cost so many time deal with cleaning, and this can become worse when enable the dynamic allocation. I post the Driver's Log in the following comments, we can find that the cleaner costs more than 2min. was: Spark Streaming sometime cost so many time deal with cleaning, and this can become worse when enable the dynamic allocation. I post the Driver Log in the following, in this log we can find that the cleaner cost more than 2min. > Spark cleaner cost too many time > > > Key: SPARK-21547 > URL: https://issues.apache.org/jira/browse/SPARK-21547 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: DjvuLee > > Spark Streaming sometime cost so many time deal with cleaning, and this can > become worse when enable the dynamic allocation. > I post the Driver's Log in the following comments, we can find that the > cleaner costs more than 2min. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21547) Spark cleaner cost too many time
[ https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-21547: Description: Spark Streaming sometime cost so many time deal with cleaning, and this can become worse when enable the dynamic allocation. I post the Driver Log in the following, in this log we can find that the cleaner cost more than 2min. was:Spark Streaming sometime cost so many time deal with cleaning, and this can become worse when enable the dynamic allocation. > Spark cleaner cost too many time > > > Key: SPARK-21547 > URL: https://issues.apache.org/jira/browse/SPARK-21547 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: DjvuLee > > Spark Streaming sometime cost so many time deal with cleaning, and this can > become worse when enable the dynamic allocation. > I post the Driver Log in the following, in this log we can find that the > cleaner cost more than 2min. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time
[ https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103005#comment-16103005 ] DjvuLee commented on SPARK-21547: - 17/07/27 11:29:51 INFO TaskSetManager: Finished task 169.0 in stage 1504.0 (TID 1504369) in 43975 ms on n6-195-137.byted.org (999/1000) 17/07/27 11:29:55 INFO TaskSetManager: Finished task 882.0 in stage 1504.0 (TID 1504905) in 44153 ms on n6-195-137.byted.org (1000/1000) 17/07/27 11:29:55 INFO YarnScheduler: Removed TaskSet 1504.0, whose tasks have all completed, from pool 17/07/27 11:29:55 INFO DAGScheduler: ResultStage 1504 (call at /spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py:2230) finished in 457.863 s 17/07/27 11:29:55 INFO DAGScheduler: Job 1504 finished: call at /spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py:2230, took 457.877969 s 17/07/27 11:30:02 INFO JobScheduler: Added jobs for time 150112620 ms 17/07/27 11:30:32 INFO JobScheduler: Added jobs for time 150112623 ms 17/07/27 11:31:02 INFO JobScheduler: Added jobs for time 150112626 ms 17/07/27 11:31:32 INFO JobScheduler: Added jobs for time 150112629 ms 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906391 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906392 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906396 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906402 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906404 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492509 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492508 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492507 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492506 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492505 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492504 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492503 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 12492502 ... 7/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906397 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906398 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906395 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906399 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906403 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906400 17/07/27 11:31:53 INFO ContextCleaner: Cleaned accumulator 10906401 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on 10.6.131.75:23734 in memory (size: 35.9 KB, free: 2.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-157-227.byted.org:13090 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-157-158.byted.org:21120 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n6-195-150.byted.org:13277 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-156-165.byted.org:35355 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n6-132-023.byted.org:52521 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-136-133.byted.org:25696 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-150-029.byted.org:34673 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-148-038.byted.org:22503 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:31:53 INFO BlockManagerInfo: Removed broadcast_1504_piece0 on n8-150-038.byted.org:28209 in memory (size: 35.9 KB, free: 9.4 GB) ... 17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on n8-163-151.byted.org:33703 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on n8-148-028.byted.org:36086 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on n8-151-039.byted.org:21081 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:32:01 INFO BlockManagerInfo: Removed broadcast_1442_piece0 on n8-157-167.byted.org:29370 in memory (size: 35.9 KB, free: 9.4 GB) 17/07/27 11:32:02 INFO JobScheduler: Added jobs for time 150112632 ms 17/07/27 11:32:32 INFO JobScheduler: Added jobs for time 150112635 ms 17/07/27 11:32:45 INFO JobScheduler: Finished job streaming job 150111696 ms.0 from job set of time 150111696 ms 17/07/27 11:32:45 INFO JobScheduler: Total delay: 9405.183 s for time 150111696 ms (execution: 1169.595 s) 17/07/27 11:32:45 INFO JobScheduler: Starting job streaming
[jira] [Updated] (SPARK-21547) Spark cleaner cost too many time
[ https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-21547: Description: Spark Streaming sometime cost so many time deal with cleaning, and this can become worse when enable the dynamic allocation. > Spark cleaner cost too many time > > > Key: SPARK-21547 > URL: https://issues.apache.org/jira/browse/SPARK-21547 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: DjvuLee > > Spark Streaming sometime cost so many time deal with cleaning, and this can > become worse when enable the dynamic allocation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21547) Spark cleaner cost too many time
DjvuLee created SPARK-21547: --- Summary: Spark cleaner cost too many time Key: SPARK-21547 URL: https://issues.apache.org/jira/browse/SPARK-21547 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.0.0 Reporter: DjvuLee -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20990) Multi-line support for JSON
[ https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102975#comment-16102975 ] Marco Gaido commented on SPARK-20990: - A PR fixing it is ready: https://github.com/apache/spark/pull/18731. > Multi-line support for JSON > --- > > Key: SPARK-20990 > URL: https://issues.apache.org/jira/browse/SPARK-20990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > When `multiLine` option is on, the existing JSON parser only reads the first > record. We should read the other records in the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102971#comment-16102971 ] Nick Pentreath commented on SPARK-10802: For those that may be interested - I opened a PR to add this functionality to {{ml}}'s {{ALSModel}} here: https://github.com/apache/spark/pull/18748 > Let ALS recommend for subset of data > > > Key: SPARK-10802 > URL: https://issues.apache.org/jira/browse/SPARK-10802 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Tomasz Bartczak >Priority: Minor > > Currently MatrixFactorizationModel allows to get recommendations for > - single user > - single product > - all users > - all products > recommendation for all users/products do a cartesian join inside. > It would be useful in some cases to get recommendations for subset of > users/products by providing an RDD with which MatrixFactorizationModel could > do an intersection before doing a cartesian join. This would make it much > faster in situation where recommendations are needed only for subset of > users/products, and when the subset is still too large to make it feasible to > recommend one-by-one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102969#comment-16102969 ] Liang-Chi Hsieh commented on SPARK-21274: - [~Tagar] Is the rewrite of INTERSECT ALL correct? Take the example at https://github.com/apache/spark/pull/11106#issuecomment-182603275: {code} [1, 2, 2] intersect_all [1, 2] == [1, 2] [1, 2, 2] intersect_all [1, 2, 2] == [1, 2, 2] {code} Looks like the rewrite returns [1, 2, 2] for two queries. Isn't? Or I misread something? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure
[ https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-21546: Description: With today's master... The following streaming query with watermark and {{dropDuplicates}} yields {{RuntimeException}} due to failure in binding. {code} val topic1 = spark. readStream. format("kafka"). option("subscribe", "topic1"). option("kafka.bootstrap.servers", "localhost:9092"). option("startingoffsets", "earliest"). load val records = topic1. withColumn("eventtime", 'timestamp). // <-- just to put the right name given the purpose withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- use the renamed eventtime column dropDuplicates("value"). // dropDuplicates will use watermark // only when eventTime column exists // include the watermark column => internal design leak? select('key cast "string", 'value cast "string", 'eventtime). as[(String, String, java.sql.Timestamp)] scala> records.explain == Physical Plan == *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS value#170, eventtime#157-T3ms] +- StreamingDeduplicate [value#1], StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0 +- Exchange hashpartitioning(value#1, 200) +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds +- *Project [key#0, value#1, timestamp#5 AS eventtime#157] +- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6] import org.apache.spark.sql.streaming.{OutputMode, Trigger} val sq = records. writeStream. format("console"). option("truncate", false). trigger(Trigger.ProcessingTime("10 seconds")). queryName("from-kafka-topic1-to-console"). outputMode(OutputMode.Update). start {code} {code} --- Batch: 0 --- 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: eventtime#157-T3ms at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370) at org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.streaming.WatermarkSupport$class.watermarkPredicateForKeys(statefulOperators.scala:160) at org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.watermarkPredicateForKeys$lzycompute(statefulOperators.scala:350) at
[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102954#comment-16102954 ] zhoukang commented on SPARK-21544: -- This is not a serious problem,and actually it depends on the maven central's strategy.But after look into pom.xml,i found there are some meaningless duplicate executions,so i modify some of them. > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Writing tracking file > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories > [DEBUG] Installing > org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure
[ https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-21546: Summary: dropDuplicates with watermark yields RuntimeException due to binding failure (was: dropDuplicates followed by select yields RuntimeException due to binding failure) > dropDuplicates with watermark yields RuntimeException due to binding failure > > > Key: SPARK-21546 > URL: https://issues.apache.org/jira/browse/SPARK-21546 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski > > With today's master... > The following streaming query yields {{RuntimeException}} due to failure in > binding (most likely due to {{select}} operator). > {code} > val topic1 = spark. > readStream. > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > option("startingoffsets", "earliest"). > load > val records = topic1. > withColumn("eventtime", 'timestamp). // <-- just to put the right name > given the purpose > withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // > <-- use the renamed eventtime column > dropDuplicates("value"). // dropDuplicates will use watermark > // only when eventTime column exists > // include the watermark column => internal design leak? > select('key cast "string", 'value cast "string", 'eventtime). > as[(String, String, java.sql.Timestamp)] > scala> records.explain > == Physical Plan == > *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS > value#170, eventtime#157-T3ms] > +- StreamingDeduplicate [value#1], > StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), > 0 >+- Exchange hashpartitioning(value#1, 200) > +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds > +- *Project [key#0, value#1, timestamp#5 AS eventtime#157] > +- StreamingRelation kafka, [key#0, value#1, topic#2, > partition#3, offset#4L, timestamp#5, timestampType#6] > import org.apache.spark.sql.streaming.{OutputMode, Trigger} > val sq = records. > writeStream. > format("console"). > option("truncate", false). > trigger(Trigger.ProcessingTime("10 seconds")). > queryName("from-kafka-topic1-to-console"). > outputMode(OutputMode.Update). > start > {code} > {code} > --- > Batch: 0 > --- > 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID > 438) > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: eventtime#157-T3ms > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977) > at > org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370) > at >
[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102950#comment-16102950 ] zhoukang commented on SPARK-21544: -- This depends on maven central's strategy,whether can override or not. > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Writing tracking file > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories > [DEBUG] Installing > org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21546) dropDuplicates followed by select yields RuntimeException due to binding failure
Jacek Laskowski created SPARK-21546: --- Summary: dropDuplicates followed by select yields RuntimeException due to binding failure Key: SPARK-21546 URL: https://issues.apache.org/jira/browse/SPARK-21546 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jacek Laskowski With today's master... The following streaming query yields {{RuntimeException}} due to failure in binding (most likely due to {{select}} operator). {code} val topic1 = spark. readStream. format("kafka"). option("subscribe", "topic1"). option("kafka.bootstrap.servers", "localhost:9092"). option("startingoffsets", "earliest"). load val records = topic1. withColumn("eventtime", 'timestamp). // <-- just to put the right name given the purpose withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // <-- use the renamed eventtime column dropDuplicates("value"). // dropDuplicates will use watermark // only when eventTime column exists // include the watermark column => internal design leak? select('key cast "string", 'value cast "string", 'eventtime). as[(String, String, java.sql.Timestamp)] scala> records.explain == Physical Plan == *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS value#170, eventtime#157-T3ms] +- StreamingDeduplicate [value#1], StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0), 0 +- Exchange hashpartitioning(value#1, 200) +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds +- *Project [key#0, value#1, timestamp#5 AS eventtime#157] +- StreamingRelation kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6] import org.apache.spark.sql.streaming.{OutputMode, Trigger} val sq = records. writeStream. format("console"). option("truncate", false). trigger(Trigger.ProcessingTime("10 seconds")). queryName("from-kafka-topic1-to-console"). outputMode(OutputMode.Update). start {code} {code} --- Batch: 0 --- 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 438) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: eventtime#157-T3ms at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977) at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370) at org.apache.spark.sql.execution.streaming.StreamingDeduplicateExec.org$apache$spark$sql$execution$streaming$WatermarkSupport$$super$newPredicate(statefulOperators.scala:350) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at org.apache.spark.sql.execution.streaming.WatermarkSupport$$anonfun$watermarkPredicateForKeys$1.apply(statefulOperators.scala:160) at scala.Option.map(Option.scala:146) at
[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures
[ https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102937#comment-16102937 ] zhoukang commented on SPARK-21543: -- And in my production cluster,there is a case:executor init failed since bad disk(can not register to shuffle server), after 4 times,job exited. > Should not count executor initialize failed towards task failures > - > > Key: SPARK-21543 > URL: https://issues.apache.org/jira/browse/SPARK-21543 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > Till now, when executor init failed and exit with error code = 1, it will > count toward task failures.Which i think should not count executor initialize > failed towards task failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures
[ https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102934#comment-16102934 ] zhoukang commented on SPARK-21543: -- >From my understanding, executor init failed then actually task has not been >launched. > Should not count executor initialize failed towards task failures > - > > Key: SPARK-21543 > URL: https://issues.apache.org/jira/browse/SPARK-21543 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > Till now, when executor init failed and exit with error code = 1, it will > count toward task failures.Which i think should not count executor initialize > failed towards task failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102929#comment-16102929 ] Sean Owen commented on SPARK-21544: --- Spark is successfully released to Maven Central by its maintainers. I am not sure it's meant to be something you can release yourself via a different mechanism. I'm also not clear what the impact is on the Spark release process. > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Writing tracking file > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories > [DEBUG] Installing > org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102918#comment-16102918 ] zhoukang commented on SPARK-21544: -- i have updated log.This will cause we deploy to nexus failed. Because when execute 'mvn deploy' it will upload tests.jar twice to nexus.However , for releases repositories we can not upload twice,then deploy will fail. Workaround is deploy with -pl option.But this is not smart i think. > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Writing tracking file > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories > [DEBUG] Installing > org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21544: - Description: For moudle below: common/network-common streaming sql/core sql/catalyst tests.jar will install or deploy twice.Like: {code:java} [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Writing tracking file /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories [DEBUG] Installing org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged {code} The reason is below: {code:java} [DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT [DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 -mdh2.1.0.1-SNAPSHOT] {code} when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be override. was: For moudle below: common/network-common streaming sql/core sql/catalyst tests.jar will install or deploy twice.Like: {code:java} [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged {code} The reason is below: {code:java} [DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT [DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 -mdh2.1.0.1-SNAPSHOT] {code} when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be override. > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Writing tracking file >
[jira] [Resolved] (SPARK-21545) pyspark2
[ https://issues.apache.org/jira/browse/SPARK-21545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21545. --- Resolution: Invalid Fix Version/s: (was: 2.2.0) Target Version/s: (was: 2.2.0) Please read http://spark.apache.org/contributing.html This is not the place for questions. > pyspark2 > > > Key: SPARK-21545 > URL: https://issues.apache.org/jira/browse/SPARK-21545 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 > Environment: Spark2.2 with CDH5.12,python3.6.1,java jdk1.8_b31. >Reporter: gumpcheng > Labels: cdh, spark2.2 > > I install spark2.2 following the official steps with CDH5.12. > Info on Cloudera Manager is okay! > But I failed to initialize pyspark2. > My Environment : Python3.6.1,JDK1.8,CDH5.12 > The problem make me crazy for several days. > And I found no way to solve it. > Anyone can help me? > Very thank you!!! > [hdfs@Master /data/soft/spark2.2]$ pyspark2 > Python 3.6.1 (default, Jul 27 2017, 11:07:01) > [GCC 4.4.6 20110731 (Red Hat 4.4.6-4)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/07/27 12:02:09 ERROR spark.SparkContext: Error initializing SparkContext. > org.apache.spark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) > at org.apache.spark.SparkContext.(SparkContext.scala:509) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:748) > 17/07/27 12:02:09 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Attempted to request executors before the AM has registered! > 17/07/27 12:02:09 ERROR util.Utils: Uncaught exception in thread Thread-2 > java.lang.NullPointerException > at > org.apache.spark.network.shuffle.ExternalShuffleClient.close(ExternalShuffleClient.java:141) > at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1485) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:90) > at > org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1937) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1936) > at org.apache.spark.SparkContext.(SparkContext.scala:587) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:748) > /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/python/pyspark/shell.py:52: > UserWarning: Fall back to non-hive support because failing to access > HiveConf, please make
[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102856#comment-16102856 ] Sean Owen commented on SPARK-21544: --- What is the problem? I don't see two installations from the log you create, and not clear what the effect would be. > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures
[ https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102854#comment-16102854 ] Sean Owen commented on SPARK-21543: --- Why shouldn't it? > Should not count executor initialize failed towards task failures > - > > Key: SPARK-21543 > URL: https://issues.apache.org/jira/browse/SPARK-21543 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > Till now, when executor init failed and exit with error code = 1, it will > count toward task failures.Which i think should not count executor initialize > failed towards task failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
[ https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21271: --- Assignee: Kazuaki Ishizaki > UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8 > --- > > Key: SPARK-21271 > URL: https://issues.apache.org/jira/browse/SPARK-21271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu >Assignee: Kazuaki Ishizaki > Fix For: 2.3.0 > > > The method is: > {code} > public int hashCode() { > return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, > sizeInBytes, 42); > } > {code} > but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords > throws assertion) - for example here: > {code}FixedLengthRowBasedKeyValueBatch.appendRow{code} > The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a > prefix that is multiple of 8. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
[ https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21271. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18503 [https://github.com/apache/spark/pull/18503] > UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8 > --- > > Key: SPARK-21271 > URL: https://issues.apache.org/jira/browse/SPARK-21271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bogdan Raducanu > Fix For: 2.3.0 > > > The method is: > {code} > public int hashCode() { > return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, > sizeInBytes, 42); > } > {code} > but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords > throws assertion) - for example here: > {code}FixedLengthRowBasedKeyValueBatch.appendRow{code} > The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a > prefix that is multiple of 8. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19511) insert into table does not work on second session of beeline
[ https://issues.apache.org/jira/browse/SPARK-19511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102804#comment-16102804 ] xinzhang commented on SPARK-19511: -- [~chenerlu] hi it always appear . which scene does it do not appear.? > insert into table does not work on second session of beeline > > > Key: SPARK-19511 > URL: https://issues.apache.org/jira/browse/SPARK-19511 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Centos 7.2, java 1.7.0_91 >Reporter: sanjiv marathe > > same issue spark-11083 ...reopen ? > insert into table works for the first session of beeline; and fails in the > second session of beeline. > Everytime, I had to restart thrift server and connect again to get it working. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21538) Attribute resolution inconsistency in Dataset API
[ https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102782#comment-16102782 ] Xiao Li commented on SPARK-21538: - https://github.com/apache/spark/pull/18740 > Attribute resolution inconsistency in Dataset API > - > > Key: SPARK-21538 > URL: https://issues.apache.org/jira/browse/SPARK-21538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu > > {code} > spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works > spark.range(1).withColumnRenamed("id", "x").sort($"id") // works > spark.range(1).withColumnRenamed("id", "x").sort('id) // works > spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: > org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among > (x); > ... > {code} > It looks like the Dataset API functions taking {{String}} use the basic > resolver that only look at the columns at that level, whereas all the other > means of expressing an attribute are lazily resolved during the analyzer. > The reason why the first 3 calls work is explained in the docs for {{object > ResolveMissingReferences}}: > {code} > /** >* In many dialects of SQL it is valid to sort by attributes that are not > present in the SELECT >* clause. This rule detects such queries and adds the required attributes > to the original >* projection, so that they will be available during sorting. Another > projection is added to >* remove these attributes after sorting. >* >* The HAVING clause could also used a grouping columns that is not > presented in the SELECT. >*/ > {code} > For consistency, it would be good to use the same attribute resolution > mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect
[ https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102775#comment-16102775 ] xinzhang commented on SPARK-11083: -- reappeared in Spark 2.1.0. any one working on this issue? > insert overwrite table failed when beeline reconnect > > > Key: SPARK-11083 > URL: https://issues.apache.org/jira/browse/SPARK-11083 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Spark: master branch > Hadoop: 2.7.1 > JDK: 1.8.0_60 >Reporter: Weizhong >Assignee: Davies Liu > > 1. Start Thriftserver > 2. Use beeline connect to thriftserver, then execute "insert overwrite > table_name ..." clause -- success > 3. Exit beelin > 4. Reconnect to thriftserver, and then execute "insert overwrite table_name > ..." clause. -- failed > {noformat} > 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) > at > org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:144) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:129) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move > source > hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048 > to destination >
[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures
[ https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102767#comment-16102767 ] zhoukang commented on SPARK-21543: -- I have created a pr https://github.com/apache/spark/pull/18743 > Should not count executor initialize failed towards task failures > - > > Key: SPARK-21543 > URL: https://issues.apache.org/jira/browse/SPARK-21543 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > Till now, when executor init failed and exit with error code = 1, it will > count toward task failures.Which i think should not count executor initialize > failed towards task failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102766#comment-16102766 ] zhoukang commented on SPARK-21544: -- I have create a pr: https://github.com/apache/spark/pull/18745 > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org