[jira] [Updated] (SPARK-25602) SparkPlan.getByteArrayRdd should not consume the input when not necessary
[ https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25602: Summary: SparkPlan.getByteArrayRdd should not consume the input when not necessary (was: range metrics can be wrong if the result rows are not fully consumed) > SparkPlan.getByteArrayRdd should not consume the input when not necessary > - > > Key: SPARK-25602 > URL: https://issues.apache.org/jira/browse/SPARK-25602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25601) Register Grouped aggregate UDF Vectorized UDFs for SQL Statement
[ https://issues.apache.org/jira/browse/SPARK-25601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25601. -- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 3.0.0 2.4.0 Fixed in https://github.com/apache/spark/pull/22620 > Register Grouped aggregate UDF Vectorized UDFs for SQL Statement > > > Key: SPARK-25601 > URL: https://issues.apache.org/jira/browse/SPARK-25601 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Capable of registering grouped aggregate UDsF and then use it in SQL > statement. > For example, > {code} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("integer", PandasUDFType.GROUPED_AGG) # doctest: +SKIP > def sum_udf(v): > return v.sum() > spark.udf.register("sum_udf", sum_udf) # doctest: +SKIP > q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP > BY v2" > spark.sql(q).show() > +---+ > |sum_udf(v1)| > +---+ > | 1| > | 5| > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637657#comment-16637657 ] Chongyuan Xiang edited comment on SPARK-25461 at 10/4/18 12:37 AM: --- Hi all, thanks for looking into the issue! As a follow up, I noticed that there were similar issues with casting to float as well. Just reusing my example and changing the return type to be FloatType: Script: {code:java} import pandas as pd import random import pyspark from pyspark.sql.functions import col, lit, pandas_udf values = [None] * 3 + [1.0] * 17 + [2.0] * 600 random.shuffle(values) pdf = pd.DataFrame({'A': values}) df = spark.createDataFrame(pdf) @pandas_udf(returnType=pyspark.sql.types.FloatType()) def gt_2(column): return (column >= 2).where(column.notnull()) calculated_df = (df.select(['A']) .withColumn('potential_bad_col', gt_2('A')) ) calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull())) calculated_df.filter(col("A") == 2).show(30) {code} Output: {code:java} +---+-+---+ | A|potential_bad_col|correct_col| +---+-+---+ |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| +---+-+---+{code} was (Author: xiangcy): Hi all, thanks for looking into the issue! As a follow up, I noticed that there were similar issues with casting to float as well. Just reusing my example and changing the return type to be `FloatType`: Script: {code:java} import pandas as pd import random import pyspark from pyspark.sql.functions import col, lit, pandas_udf values = [None] * 3 + [1.0] * 17 + [2.0] * 600 random.shuffle(values) pdf = pd.DataFrame({'A': values}) df = spark.createDataFrame(pdf) @pandas_udf(returnType=pyspark.sql.types.FloatType()) def gt_2(column): return (column >= 2).where(column.notnull()) calculated_df = (df.select(['A']) .withColumn('potential_bad_col', gt_2('A')) ) calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull())) calculated_df.filter(col("A") == 2).show(30) {code} Output: {code:java} +---+-+---+ | A|potential_bad_col|correct_col| +---+-+---+ |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| +---+-+---+{code} > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false|
[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637657#comment-16637657 ] Chongyuan Xiang commented on SPARK-25461: - Hi all, thanks for looking into the issue! As a follow up, I noticed that there were similar issues with casting to float as well. Just reusing my example and changing the return type to be `FloatType`: Script: {code:java} import pandas as pd import random import pyspark from pyspark.sql.functions import col, lit, pandas_udf values = [None] * 3 + [1.0] * 17 + [2.0] * 600 random.shuffle(values) pdf = pd.DataFrame({'A': values}) df = spark.createDataFrame(pdf) @pandas_udf(returnType=pyspark.sql.types.FloatType()) def gt_2(column): return (column >= 2).where(column.notnull()) calculated_df = (df.select(['A']) .withColumn('potential_bad_col', gt_2('A')) ) calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | (col("A").isNull())) calculated_df.filter(col("A") == 2).show(30) {code} Output: {code:java} +---+-+---+ | A|potential_bad_col|correct_col| +---+-+---+ |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 1.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| |2.0| 0.0| true| +---+-+---+{code} > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626 ] Bryan Cutler edited comment on SPARK-25461 at 10/3/18 11:53 PM: I filed ARROW-3428, which deals with the incorrect cast from float to bool was (Author: bryanc): I file ARROW-3428, which deals with the incorrect cast from float to bool > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637630#comment-16637630 ] Steven Rand commented on SPARK-25538: - Thanks all! > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.4.0 > > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626 ] Bryan Cutler commented on SPARK-25461: -- I file ARROW-3428, which deals with the incorrect cast from float to bool > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-25586: --- Issue Type: Bug (was: Improvement) > toString method of GeneralizedLinearRegressionTrainingSummary runs in > infinite loop throwing StackOverflowError > --- > > Key: SPARK-25586 > URL: https://issues.apache.org/jira/browse/SPARK-25586 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Assignee: Ankur Gupta >Priority: Minor > Fix For: 3.0.0 > > > After the change in SPARK-25118, which enables spark-shell to run with > default log level, test_glr_summary started failing with StackOverflow error. > Cause: ClosureCleaner calls logDebug on various objects and when it is called > for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which > runs into infinite loop and fails with the below exception. > {code} > == > ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", > line 1809, in test_glr_summary > self.assertTrue(isinstance(s.aic, float)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py", > line 1781, in aic > return self._call_java("aic") > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py", > line 55, in _call_java > return _java2py(sc, m(*java_args)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o31639.aic. > : java.lang.StackOverflowError > at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) > at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) > at java.io.File.exists(File.java:819) > at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245) > at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212) > at sun.misc.URLClassPath.findResource(URLClassPath.java:188) > at java.net.URLClassLoader$2.run(URLClassLoader.java:569) > at java.net.URLClassLoader$2.run(URLClassLoader.java:567) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findResource(URLClassLoader.java:566) > at java.lang.ClassLoader.getResource(ClassLoader.java:1093) > at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) > at java.lang.Class.getResourceAsStream(Class.java:2223) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2342) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:364) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at >
[jira] [Assigned] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop
[ https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25637: Assignee: (was: Apache Spark) > SparkException: Could not find CoarseGrainedScheduler occurs during the > application stop > > > Key: SPARK-25637 > URL: https://issues.apache.org/jira/browse/SPARK-25637 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Minor > > {code:xml} > 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449) > at > org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197) > at > org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > SPARK-14228 fixed these kind of errors but still this is occurring while > performing reviveOffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop
[ https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25637: Assignee: Apache Spark > SparkException: Could not find CoarseGrainedScheduler occurs during the > application stop > > > Key: SPARK-25637 > URL: https://issues.apache.org/jira/browse/SPARK-25637 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Assignee: Apache Spark >Priority: Minor > > {code:xml} > 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449) > at > org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197) > at > org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > SPARK-14228 fixed these kind of errors but still this is occurring while > performing reviveOffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop
[ https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637604#comment-16637604 ] Apache Spark commented on SPARK-25637: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/22625 > SparkException: Could not find CoarseGrainedScheduler occurs during the > application stop > > > Key: SPARK-25637 > URL: https://issues.apache.org/jira/browse/SPARK-25637 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Minor > > {code:xml} > 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449) > at > org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197) > at > org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > SPARK-14228 fixed these kind of errors but still this is occurring while > performing reviveOffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop
[ https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637606#comment-16637606 ] Apache Spark commented on SPARK-25637: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/22625 > SparkException: Could not find CoarseGrainedScheduler occurs during the > application stop > > > Key: SPARK-25637 > URL: https://issues.apache.org/jira/browse/SPARK-25637 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Minor > > {code:xml} > 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error > org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) > at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449) > at > org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201) > at > org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197) > at > org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > SPARK-14228 fixed these kind of errors but still this is occurring while > performing reviveOffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637605#comment-16637605 ] Marcelo Vanzin commented on SPARK-25586: bq. This is not a bug Actually it's a bug if you set your log level to DEBUG and happen to be using that class... regardless of the other change. > toString method of GeneralizedLinearRegressionTrainingSummary runs in > infinite loop throwing StackOverflowError > --- > > Key: SPARK-25586 > URL: https://issues.apache.org/jira/browse/SPARK-25586 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Assignee: Ankur Gupta >Priority: Minor > Fix For: 3.0.0 > > > After the change in SPARK-25118, which enables spark-shell to run with > default log level, test_glr_summary started failing with StackOverflow error. > Cause: ClosureCleaner calls logDebug on various objects and when it is called > for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which > runs into infinite loop and fails with the below exception. > {code} > == > ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", > line 1809, in test_glr_summary > self.assertTrue(isinstance(s.aic, float)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py", > line 1781, in aic > return self._call_java("aic") > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py", > line 55, in _call_java > return _java2py(sc, m(*java_args)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o31639.aic. > : java.lang.StackOverflowError > at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) > at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) > at java.io.File.exists(File.java:819) > at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245) > at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212) > at sun.misc.URLClassPath.findResource(URLClassPath.java:188) > at java.net.URLClassLoader$2.run(URLClassLoader.java:569) > at java.net.URLClassLoader$2.run(URLClassLoader.java:567) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findResource(URLClassLoader.java:566) > at java.lang.ClassLoader.getResource(ClassLoader.java:1093) > at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) > at java.lang.Class.getResourceAsStream(Class.java:2223) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2342) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:364) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at >
[jira] [Resolved] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25586. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22616 [https://github.com/apache/spark/pull/22616] > toString method of GeneralizedLinearRegressionTrainingSummary runs in > infinite loop throwing StackOverflowError > --- > > Key: SPARK-25586 > URL: https://issues.apache.org/jira/browse/SPARK-25586 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Assignee: Ankur Gupta >Priority: Minor > Fix For: 3.0.0 > > > After the change in SPARK-25118, which enables spark-shell to run with > default log level, test_glr_summary started failing with StackOverflow error. > Cause: ClosureCleaner calls logDebug on various objects and when it is called > for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which > runs into infinite loop and fails with the below exception. > {code} > == > ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", > line 1809, in test_glr_summary > self.assertTrue(isinstance(s.aic, float)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py", > line 1781, in aic > return self._call_java("aic") > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py", > line 55, in _call_java > return _java2py(sc, m(*java_args)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o31639.aic. > : java.lang.StackOverflowError > at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) > at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) > at java.io.File.exists(File.java:819) > at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245) > at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212) > at sun.misc.URLClassPath.findResource(URLClassPath.java:188) > at java.net.URLClassLoader$2.run(URLClassLoader.java:569) > at java.net.URLClassLoader$2.run(URLClassLoader.java:567) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findResource(URLClassLoader.java:566) > at java.lang.ClassLoader.getResource(ClassLoader.java:1093) > at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) > at java.lang.Class.getResourceAsStream(Class.java:2223) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2342) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:364) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) >
[jira] [Assigned] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25586: -- Assignee: Ankur Gupta > toString method of GeneralizedLinearRegressionTrainingSummary runs in > infinite loop throwing StackOverflowError > --- > > Key: SPARK-25586 > URL: https://issues.apache.org/jira/browse/SPARK-25586 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Assignee: Ankur Gupta >Priority: Minor > Fix For: 3.0.0 > > > After the change in SPARK-25118, which enables spark-shell to run with > default log level, test_glr_summary started failing with StackOverflow error. > Cause: ClosureCleaner calls logDebug on various objects and when it is called > for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which > runs into infinite loop and fails with the below exception. > {code} > == > ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", > line 1809, in test_glr_summary > self.assertTrue(isinstance(s.aic, float)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py", > line 1781, in aic > return self._call_java("aic") > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py", > line 55, in _call_java > return _java2py(sc, m(*java_args)) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o31639.aic. > : java.lang.StackOverflowError > at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) > at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) > at java.io.File.exists(File.java:819) > at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245) > at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212) > at sun.misc.URLClassPath.findResource(URLClassPath.java:188) > at java.net.URLClassLoader$2.run(URLClassLoader.java:569) > at java.net.URLClassLoader$2.run(URLClassLoader.java:567) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findResource(URLClassLoader.java:566) > at java.lang.ClassLoader.getResource(ClassLoader.java:1093) > at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232) > at java.lang.Class.getResourceAsStream(Class.java:2223) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2342) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:364) > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at >
[jira] [Created] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop
Devaraj K created SPARK-25637: - Summary: SparkException: Could not find CoarseGrainedScheduler occurs during the application stop Key: SPARK-25637 URL: https://issues.apache.org/jira/browse/SPARK-25637 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.2 Reporter: Devaraj K {code:xml} 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449) at org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638) at org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201) at org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} SPARK-14228 fixed these kind of errors but still this is occurring while performing reviveOffers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23781) Merge YARN and Mesos token renewal code
[ https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637582#comment-16637582 ] Apache Spark commented on SPARK-23781: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22624 > Merge YARN and Mesos token renewal code > --- > > Key: SPARK-23781 > URL: https://issues.apache.org/jira/browse/SPARK-23781 > Project: Spark > Issue Type: Improvement > Components: Mesos, YARN >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > With the fix for SPARK-23361, the code that handles delegation tokens in > Mesos and YARN ends up being very similar. > We should refactor that code so that both backends are sharing the same code, > which also would make it easier for other cluster managers to use that code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23781) Merge YARN and Mesos token renewal code
[ https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637581#comment-16637581 ] Apache Spark commented on SPARK-23781: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22624 > Merge YARN and Mesos token renewal code > --- > > Key: SPARK-23781 > URL: https://issues.apache.org/jira/browse/SPARK-23781 > Project: Spark > Issue Type: Improvement > Components: Mesos, YARN >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > With the fix for SPARK-23361, the code that handles delegation tokens in > Mesos and YARN ends up being very similar. > We should refactor that code so that both backends are sharing the same code, > which also would make it easier for other cluster managers to use that code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23781) Merge YARN and Mesos token renewal code
[ https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23781: Assignee: Apache Spark > Merge YARN and Mesos token renewal code > --- > > Key: SPARK-23781 > URL: https://issues.apache.org/jira/browse/SPARK-23781 > Project: Spark > Issue Type: Improvement > Components: Mesos, YARN >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > With the fix for SPARK-23361, the code that handles delegation tokens in > Mesos and YARN ends up being very similar. > We should refactor that code so that both backends are sharing the same code, > which also would make it easier for other cluster managers to use that code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23781) Merge YARN and Mesos token renewal code
[ https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23781: Assignee: (was: Apache Spark) > Merge YARN and Mesos token renewal code > --- > > Key: SPARK-23781 > URL: https://issues.apache.org/jira/browse/SPARK-23781 > Project: Spark > Issue Type: Improvement > Components: Mesos, YARN >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > With the fix for SPARK-23361, the code that handles delegation tokens in > Mesos and YARN ends up being very similar. > We should refactor that code so that both backends are sharing the same code, > which also would make it easier for other cluster managers to use that code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)
[ https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637564#comment-16637564 ] Shixiong Zhu commented on SPARK-25005: -- [~qambard] Not sure about your question. If Kafka consumers fetch nothing, it will not update the position. And yes, if a partition is full with invisible messages, we have to wait for timeout. I don't see any API to avoid this. > Structured streaming doesn't support kafka transaction (creating empty offset > with abort & markers) > --- > > Key: SPARK-25005 > URL: https://issues.apache.org/jira/browse/SPARK-25005 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Structured streaming can't consume kafka transaction. > We could try to apply SPARK-24720 (DStream) logic to Structured Streaming > source -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)
[ https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637560#comment-16637560 ] Quentin Ambard commented on SPARK-25005: ok I see, great idea, and the consumer ensure us that the position won't be updated if pool returns an empty list for any reason? Also if a partition is full with invisible messages due to transaction abort, we'll have to wait for the pool timeout everytime (at least that's what I see in my tests) It could hurt throughput, especially if we have to wait for each partition. Not sure how we could solve that... > Structured streaming doesn't support kafka transaction (creating empty offset > with abort & markers) > --- > > Key: SPARK-25005 > URL: https://issues.apache.org/jira/browse/SPARK-25005 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Structured streaming can't consume kafka transaction. > We could try to apply SPARK-24720 (DStream) logic to Structured Streaming > source -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)
[ https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637541#comment-16637541 ] Shixiong Zhu commented on SPARK-25005: -- [~qambard] If `poll` returns and offset gets changed, it means Kafka consumer fetches something but all of messages are invisible so consumer return empty. If `poll` returns but offset doesn't change, it means Kafka fetches nothing before timeout. In this case, we just throw "TimeoutException". Spark will retry the task or just fail the job. Large GC pause can cause timeout and the user should tune the configs to avoid this happening. We cannot do much in Spark. > Structured streaming doesn't support kafka transaction (creating empty offset > with abort & markers) > --- > > Key: SPARK-25005 > URL: https://issues.apache.org/jira/browse/SPARK-25005 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Structured streaming can't consume kafka transaction. > We could try to apply SPARK-24720 (DStream) logic to Structured Streaming > source -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)
[ https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637532#comment-16637532 ] Quentin Ambard commented on SPARK-25005: How do you make difference between data loss or data missing when .pool() doesn't return any value [~zsxwing] ? Correct me if I'm wrong but you could lose data in this situation no ? I think there is a third case here [https://github.com/zsxwing/spark/blob/ea804cfe840196519cc9444be9bedf03d10aa11a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L474] which is : something went wrong, data is available in kafka but I failed to get it. I've seen it happening when the max.pool size is big with big messages and the heap is getting full. Message exist but the jvm lags and the consumer timeout before getting the messages > Structured streaming doesn't support kafka transaction (creating empty offset > with abort & markers) > --- > > Key: SPARK-25005 > URL: https://issues.apache.org/jira/browse/SPARK-25005 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Structured streaming can't consume kafka transaction. > We could try to apply SPARK-24720 (DStream) logic to Structured Streaming > source -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master
[ https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25636: Assignee: Apache Spark > spark-submit swallows the failure reason when there is an error connecting to > master > > > Key: SPARK-25636 > URL: https://issues.apache.org/jira/browse/SPARK-25636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Assignee: Apache Spark >Priority: Minor > > {code:xml} > [apache-spark]$ ./bin/spark-submit --verbose --master spark:// > > Error: Exception thrown in awaitResult: > Run with --help for usage help or --verbose for debug output > {code} > When the spark submit cannot connect to master, there is no error shown. I > think it should display the cause for the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master
[ https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637530#comment-16637530 ] Apache Spark commented on SPARK-25636: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/22623 > spark-submit swallows the failure reason when there is an error connecting to > master > > > Key: SPARK-25636 > URL: https://issues.apache.org/jira/browse/SPARK-25636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Minor > > {code:xml} > [apache-spark]$ ./bin/spark-submit --verbose --master spark:// > > Error: Exception thrown in awaitResult: > Run with --help for usage help or --verbose for debug output > {code} > When the spark submit cannot connect to master, there is no error shown. I > think it should display the cause for the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master
[ https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25636: Assignee: (was: Apache Spark) > spark-submit swallows the failure reason when there is an error connecting to > master > > > Key: SPARK-25636 > URL: https://issues.apache.org/jira/browse/SPARK-25636 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Devaraj K >Priority: Minor > > {code:xml} > [apache-spark]$ ./bin/spark-submit --verbose --master spark:// > > Error: Exception thrown in awaitResult: > Run with --help for usage help or --verbose for debug output > {code} > When the spark submit cannot connect to master, there is no error shown. I > think it should display the cause for the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master
Devaraj K created SPARK-25636: - Summary: spark-submit swallows the failure reason when there is an error connecting to master Key: SPARK-25636 URL: https://issues.apache.org/jira/browse/SPARK-25636 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: Devaraj K {code:xml} [apache-spark]$ ./bin/spark-submit --verbose --master spark:// Error: Exception thrown in awaitResult: Run with --help for usage help or --verbose for debug output {code} When the spark submit cannot connect to master, there is no error shown. I think it should display the cause for the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25635) Support selective direct encoding in native ORC write
[ https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25635: Assignee: Dongjoon Hyun (was: Apache Spark) > Support selective direct encoding in native ORC write > - > > Key: SPARK-25635 > URL: https://issues.apache.org/jira/browse/SPARK-25635 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Before ORC 1.5.3, `orc.dictionary.key.threshold` and > `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. > This is a big huddle to enable dictionary encoding. > From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct > encoding selectively in a column-wise manner. This issue aims to add that > feature by upgrading ORC from 1.5.2 to 1.5.3. > The followings are the patches in ORC 1.5.3 and this feature is the only one > related to Spark directly. > {code} > ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts > multi-byte data (gopalv) > ORC-403: [C++] Add checks to avoid invalid offsets in InputStream > ORC-405. Remove calcite as a dependency from the benchmarks. > ORC-375: Fix libhdfs on gcc7 by adding #include two places. > ORC-383: Parallel builds fails with ConcurrentModificationException > ORC-382: Apache rat exclusions + add rat check to travis > ORC-401: Fix incorrect quoting in specification. > ORC-385. Change RecordReader to extend Closeable. > ORC-384: [C++] fix memory leak when loading non-ORC files > ORC-391: [c++] parseType does not accept underscore in the field name > ORC-397. Allow selective disabling of dictionary encoding. Original patch was > by Mithun Radhakrishnan. > ORC-389: Add ability to not decode Acid metadata columns > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25635) Support selective direct encoding in native ORC write
[ https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637506#comment-16637506 ] Apache Spark commented on SPARK-25635: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22622 > Support selective direct encoding in native ORC write > - > > Key: SPARK-25635 > URL: https://issues.apache.org/jira/browse/SPARK-25635 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Before ORC 1.5.3, `orc.dictionary.key.threshold` and > `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. > This is a big huddle to enable dictionary encoding. > From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct > encoding selectively in a column-wise manner. This issue aims to add that > feature by upgrading ORC from 1.5.2 to 1.5.3. > The followings are the patches in ORC 1.5.3 and this feature is the only one > related to Spark directly. > {code} > ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts > multi-byte data (gopalv) > ORC-403: [C++] Add checks to avoid invalid offsets in InputStream > ORC-405. Remove calcite as a dependency from the benchmarks. > ORC-375: Fix libhdfs on gcc7 by adding #include two places. > ORC-383: Parallel builds fails with ConcurrentModificationException > ORC-382: Apache rat exclusions + add rat check to travis > ORC-401: Fix incorrect quoting in specification. > ORC-385. Change RecordReader to extend Closeable. > ORC-384: [C++] fix memory leak when loading non-ORC files > ORC-391: [c++] parseType does not accept underscore in the field name > ORC-397. Allow selective disabling of dictionary encoding. Original patch was > by Mithun Radhakrishnan. > ORC-389: Add ability to not decode Acid metadata columns > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25635) Support selective direct encoding in native ORC write
[ https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25635: Assignee: Apache Spark (was: Dongjoon Hyun) > Support selective direct encoding in native ORC write > - > > Key: SPARK-25635 > URL: https://issues.apache.org/jira/browse/SPARK-25635 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Before ORC 1.5.3, `orc.dictionary.key.threshold` and > `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. > This is a big huddle to enable dictionary encoding. > From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct > encoding selectively in a column-wise manner. This issue aims to add that > feature by upgrading ORC from 1.5.2 to 1.5.3. > The followings are the patches in ORC 1.5.3 and this feature is the only one > related to Spark directly. > {code} > ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts > multi-byte data (gopalv) > ORC-403: [C++] Add checks to avoid invalid offsets in InputStream > ORC-405. Remove calcite as a dependency from the benchmarks. > ORC-375: Fix libhdfs on gcc7 by adding #include two places. > ORC-383: Parallel builds fails with ConcurrentModificationException > ORC-382: Apache rat exclusions + add rat check to travis > ORC-401: Fix incorrect quoting in specification. > ORC-385. Change RecordReader to extend Closeable. > ORC-384: [C++] fix memory leak when loading non-ORC files > ORC-391: [c++] parseType does not accept underscore in the field name > ORC-397. Allow selective disabling of dictionary encoding. Original patch was > by Mithun Radhakrishnan. > ORC-389: Add ability to not decode Acid metadata columns > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25635) Support selective direct encoding in native ORC write
[ https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25635: - Assignee: Dongjoon Hyun > Support selective direct encoding in native ORC write > - > > Key: SPARK-25635 > URL: https://issues.apache.org/jira/browse/SPARK-25635 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Before ORC 1.5.3, `orc.dictionary.key.threshold` and > `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. > This is a big huddle to enable dictionary encoding. > From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct > encoding selectively in a column-wise manner. This issue aims to add that > feature by upgrading ORC from 1.5.2 to 1.5.3. > The followings are the patches in ORC 1.5.3 and this feature is the only one > related to Spark directly. > {code} > ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts > multi-byte data (gopalv) > ORC-403: [C++] Add checks to avoid invalid offsets in InputStream > ORC-405. Remove calcite as a dependency from the benchmarks. > ORC-375: Fix libhdfs on gcc7 by adding #include two places. > ORC-383: Parallel builds fails with ConcurrentModificationException > ORC-382: Apache rat exclusions + add rat check to travis > ORC-401: Fix incorrect quoting in specification. > ORC-385. Change RecordReader to extend Closeable. > ORC-384: [C++] fix memory leak when loading non-ORC files > ORC-391: [c++] parseType does not accept underscore in the field name > ORC-397. Allow selective disabling of dictionary encoding. Original patch was > by Mithun Radhakrishnan. > ORC-389: Add ability to not decode Acid metadata columns > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25635) Support selective direct encoding in native ORC write
Dongjoon Hyun created SPARK-25635: - Summary: Support selective direct encoding in native ORC write Key: SPARK-25635 URL: https://issues.apache.org/jira/browse/SPARK-25635 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Before ORC 1.5.3, `orc.dictionary.key.threshold` and `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. This is a big huddle to enable dictionary encoding. >From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct >encoding selectively in a column-wise manner. This issue aims to add that >feature by upgrading ORC from 1.5.2 to 1.5.3. The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly. {code} ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv) ORC-403: [C++] Add checks to avoid invalid offsets in InputStream ORC-405. Remove calcite as a dependency from the benchmarks. ORC-375: Fix libhdfs on gcc7 by adding #include two places. ORC-383: Parallel builds fails with ConcurrentModificationException ORC-382: Apache rat exclusions + add rat check to travis ORC-401: Fix incorrect quoting in specification. ORC-385. Change RecordReader to extend Closeable. ORC-384: [C++] fix memory leak when loading non-ORC files ORC-391: [c++] parseType does not accept underscore in the field name ORC-397. Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan. ORC-389: Add ability to not decode Acid metadata columns {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25633) Performance Improvement for Drools Spark Jobs.
[ https://issues.apache.org/jira/browse/SPARK-25633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637485#comment-16637485 ] Koushik commented on SPARK-25633: - yes we can connect @ 11 AM ET tomorrow. > Performance Improvement for Drools Spark Jobs. > -- > > Key: SPARK-25633 > URL: https://issues.apache.org/jira/browse/SPARK-25633 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 > Environment: [link title|http:[link > title|http://example.com]//example.com][link title|http://example.com][link > title|http://example.com] >Reporter: Koushik >Priority: Major > Fix For: 2.2.0 > > Attachments: RTTA Performance Issue.pptx > > > We have below region wise compute instance on performance environment. when > we reduce the compute instances, we face performance issue > we have already done code optimization..[link title|http://example.com] > > |Region|Commute Instances on performance environment| > |MWSW|6| > |SE|6| > |W|6| > |Total|*18*| > > for above combination 98% data process within 30 seconds but when we reduce > instances performance degrade. > > we would provide all additional details to respective support team on request -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"
[ https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637464#comment-16637464 ] Antonio Pedro de Sousa Vieira commented on SPARK-17895: --- These changes seem to only have been applied to Scala docs, SparkR and PySpark docs are still equal for the two methods. Should this be reopened? > Improve documentation of "rowsBetween" and "rangeBetween" > - > > Key: SPARK-17895 > URL: https://issues.apache.org/jira/browse/SPARK-17895 > Project: Spark > Issue Type: Documentation > Components: PySpark, SparkR, SQL >Reporter: Weiluo Ren >Assignee: Weiluo Ren >Priority: Minor > Fix For: 2.1.0 > > > This is an issue found by [~junyangq] when he was fixing SparkR docs. > In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]). > However, the description of "rangeBetween" does not clearly differentiate it > from "rowsBetween". Even though in > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109] > we have pretty nice description for "RangeFrame" and "RowFrame" which are > used in "rangeBetween" and "rowsBetween", I cannot find them in the online > Spark scala api. > We could add small examples to the description of "rangeBetween" and > "rowsBetween" like > {code} > val df = Seq(1,1,2).toDF("id") > df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show > /** > * It shows > * +---+---+ > * | id|sum| > * +---+---+ > * | 1| 4| > * | 1| 4| > * | 2| 2| > * +---+---+ > */ > df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show > /** > * It shows > * +---+---+ > * | id|sum| > * +---+---+ > * | 1| 2| > * | 1| 3| > * | 2| 2| > * +---+---+ > */ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25633) Performance Improvement for Drools Spark Jobs.
[ https://issues.apache.org/jira/browse/SPARK-25633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koushik updated SPARK-25633: Attachment: RTTA Performance Issue.pptx > Performance Improvement for Drools Spark Jobs. > -- > > Key: SPARK-25633 > URL: https://issues.apache.org/jira/browse/SPARK-25633 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 > Environment: [link title|http:[link > title|http://example.com]//example.com][link title|http://example.com][link > title|http://example.com] >Reporter: Koushik >Priority: Major > Fix For: 2.2.0 > > Attachments: RTTA Performance Issue.pptx > > > We have below region wise compute instance on performance environment. when > we reduce the compute instances, we face performance issue > we have already done code optimization..[link title|http://example.com] > > |Region|Commute Instances on performance environment| > |MWSW|6| > |SE|6| > |W|6| > |Total|*18*| > > for above combination 98% data process within 30 seconds but when we reduce > instances performance degrade. > > we would provide all additional details to respective support team on request -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application
[ https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637436#comment-16637436 ] Ye Zhou commented on SPARK-25634: - [~felixcheung] [~vanzin] [~tgraves] [~irashid] [~zsxwing] More comments? Thanks > New Metrics in External Shuffle Service to help identify abusing application > > > Key: SPARK-25634 > URL: https://issues.apache.org/jira/browse/SPARK-25634 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Ye Zhou >Priority: Minor > > We run Spark on YARN, and deploy Spark external shuffle service as part of > YARN NM aux service. External Shuffle Service is shared by all Spark > applications. SPARK-24355 enables the threads reservation to handle > non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to > avoid OOM in shuffle service which could crash NodeManager. But still some > application may generate a large amount of shuffle blocks which could heavily > decrease the performance on some shuffle servers. When this abusing behavior > happens, it might further decreases the overall performance for other > applications if they happen to use the same shuffle servers. We have been > seeing issues like this in our cluster, but there is no way for us to figure > out which application is abusing shuffle service. > SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics > System. It is better if we can have the following metrics and also metrics > divided by applicationID: > 1. *shuffle server on-heap memory consumption for caching shuffle indexes* > 2. *breakdown of shuffle indexes caching memory consumption by local > executors* > We can generate metrics when > ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will > trigger the Cache load. We can roughly be able to get the metrics from the > shuffleindexfile size when putting into the cache and moved out from the > cache. > 3. *shuffle server load for shuffle block fetch requests* > 4. *breakdown of shuffle server block fetch requests load by remote executors* > We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a > new OpenBlocks message is received. > Open discussion for more metrics that could potentially influence the overall > shuffle service performance. > We can print out those metrics which are divided by applicationIDs in log, > since it is hard to define fixed key and use numerical value for this kind of > metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application
Ye Zhou created SPARK-25634: --- Summary: New Metrics in External Shuffle Service to help identify abusing application Key: SPARK-25634 URL: https://issues.apache.org/jira/browse/SPARK-25634 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.0 Reporter: Ye Zhou We run Spark on YARN, and deploy Spark external shuffle service as part of YARN NM aux service. External Shuffle Service is shared by all Spark applications. SPARK-24355 enables the threads reservation to handle non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to avoid OOM in shuffle service which could crash NodeManager. But still some application may generate a large amount of shuffle blocks which could heavily decrease the performance on some shuffle servers. When this abusing behavior happens, it might further decreases the overall performance for other applications if they happen to use the same shuffle servers. We have been seeing issues like this in our cluster, but there is no way for us to figure out which application is abusing shuffle service. SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics System. It is better if we can have the following metrics and also metrics divided by applicationID: 1. *shuffle server on-heap memory consumption for caching shuffle indexes* 2. *breakdown of shuffle indexes caching memory consumption by local executors* We can generate metrics when ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will trigger the Cache load. We can roughly be able to get the metrics from the shuffleindexfile size when putting into the cache and moved out from the cache. 3. *shuffle server load for shuffle block fetch requests* 4. *breakdown of shuffle server block fetch requests load by remote executors* We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a new OpenBlocks message is received. Open discussion for more metrics that could potentially influence the overall shuffle service performance. We can print out those metrics which are divided by applicationIDs in log, since it is hard to define fixed key and use numerical value for this kind of metrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25633) Performance Improvement for Drools Spark Jobs.
Koushik created SPARK-25633: --- Summary: Performance Improvement for Drools Spark Jobs. Key: SPARK-25633 URL: https://issues.apache.org/jira/browse/SPARK-25633 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Environment: [link title|http:[link title|http://example.com]//example.com][link title|http://example.com][link title|http://example.com] Reporter: Koushik Fix For: 2.2.0 We have below region wise compute instance on performance environment. when we reduce the compute instances, we face performance issue we have already done code optimization..[link title|http://example.com] |Region|Commute Instances on performance environment| |MWSW|6| |SE|6| |W|6| |Total|*18*| for above combination 98% data process within 30 seconds but when we reduce instances performance degrade. we would provide all additional details to respective support team on request -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.
Xiao Li created SPARK-25632: --- Summary: KafkaRDDSuite: compacted topic 2 min 5 sec. Key: SPARK-25632 URL: https://issues.apache.org/jira/browse/SPARK-25632 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic Took 2 min 5 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec
Xiao Li created SPARK-25631: --- Summary: KafkaRDDSuite: basic usage2 min 4 sec Key: SPARK-25631 URL: https://issues.apache.org/jira/browse/SPARK-25631 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage Took 2 min 4 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25630) HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec
Xiao Li created SPARK-25630: --- Summary: HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec Key: SPARK-25630 URL: https://issues.apache.org/jira/browse/SPARK-25630 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.SPARK-8406: Avoids name collision while writing files Took 21 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25629) ParquetFilterSuite: filter pushdown - decimal 16 sec
Xiao Li created SPARK-25629: --- Summary: ParquetFilterSuite: filter pushdown - decimal 16 sec Key: SPARK-25629 URL: https://issues.apache.org/jira/browse/SPARK-25629 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite.filter pushdown - decimal Took 16 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25628) DistributedSuite: recover from repeated node failures during shuffle-reduce 40 seconds
Xiao Li created SPARK-25628: --- Summary: DistributedSuite: recover from repeated node failures during shuffle-reduce 40 seconds Key: SPARK-25628 URL: https://issues.apache.org/jira/browse/SPARK-25628 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.DistributedSuite.recover from repeated node failures during shuffle-reduce 40 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25627) ContinuousStressSuite - 8 mins 13 sec
Xiao Li created SPARK-25627: --- Summary: ContinuousStressSuite - 8 mins 13 sec Key: SPARK-25627 URL: https://issues.apache.org/jira/browse/SPARK-25627 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li ContinuousStressSuite - 8 mins 13 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25626) HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec
Xiao Li created SPARK-25626: --- Summary: HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec Key: SPARK-25626 URL: https://issues.apache.org/jira/browse/SPARK-25626 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li HiveClientSuite.2.3: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec Passed HiveClientSuite.2.2: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 45 sec Passed HiveClientSuite.2.1: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 42 sec Passed HiveClientSuite.2.0: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 39 sec Passed HiveClientSuite.1.2: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 37 sec Passed HiveClientSuite.1.1: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 36 sec Passed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25625) LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization - 33 sec
Xiao Li created SPARK-25625: --- Summary: LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization - 33 sec Key: SPARK-25625 URL: https://issues.apache.org/jira/browse/SPARK-25625 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization Took 33 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25624) LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization 56 seconds
Xiao Li created SPARK-25624: --- Summary: LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization 56 seconds Key: SPARK-25624 URL: https://issues.apache.org/jira/browse/SPARK-25624 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization Took 56 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25623) LogisticRegressionSuite: multinomial logistic regression with intercept with L1 regularization 1 min 10 sec
Xiao Li created SPARK-25623: --- Summary: LogisticRegressionSuite: multinomial logistic regression with intercept with L1 regularization 1 min 10 sec Key: SPARK-25623 URL: https://issues.apache.org/jira/browse/SPARK-25623 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic regression with intercept with L1 regularization Took 1 min 10 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25622) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables with bucket pruning filters - 42 seconds
Xiao Li created SPARK-25622: --- Summary: BucketedReadWithHiveSupportSuite: read partitioning bucketed tables with bucket pruning filters - 42 seconds Key: SPARK-25622 URL: https://issues.apache.org/jira/browse/SPARK-25622 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning bucketed tables with bucket pruning filters Took 42 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25621) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables having composite filters 45 sec
Xiao Li created SPARK-25621: --- Summary: BucketedReadWithHiveSupportSuite: read partitioning bucketed tables having composite filters 45 sec Key: SPARK-25621 URL: https://issues.apache.org/jira/browse/SPARK-25621 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning bucketed tables having composite filters Took 45 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec
[ https://issues.apache.org/jira/browse/SPARK-25619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25619: Description: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and merge shards in a stream 2 min 15 sec org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split and merge shards in a stream 1 min 52 sec. was: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and merge shards in a stream 2 min 15 sec > WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min > 15 sec > -- > > Key: SPARK-25619 > URL: https://issues.apache.org/jira/browse/SPARK-25619 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split > and merge shards in a stream 2 min 15 sec > org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split > and merge shards in a stream 1 min 52 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds
[ https://issues.apache.org/jira/browse/SPARK-25620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25620: Description: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure recovery Took 1 min 36 sec. org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure recovery Took 1 min 24 sec. was: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure recovery Took 1 min 36 sec. > WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds > > > Key: SPARK-25620 > URL: https://issues.apache.org/jira/browse/SPARK-25620 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure > recovery > Took 1 min 36 sec. > org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure > recovery > Took 1 min 24 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library
[ https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Brugiere reopened SPARK-25582: - > Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java > library > - > > Key: SPARK-25582 > URL: https://issues.apache.org/jira/browse/SPARK-25582 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: Thomas Brugiere >Priority: Major > Attachments: fileA.csv, fileB.csv, fileC.csv > > > I have noticed an error that appears in the Spark logs when using the Spark > SQL library in a Java 8 project. > When I run the code below with the attached files as input, I can see the > ERROR below in the application logs. > I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java > project > Note that the same logic implemented with the Python API (pyspark) doesn't > produce any Exception like this. > *Code* > {code:java} > SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local"); > SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate(); > Dataset df_a = sparkSession.read().option("header", > true).csv("local/fileA.csv").dropDuplicates(); > Dataset df_b = sparkSession.read().option("header", > true).csv("local/fileB.csv").dropDuplicates(); > Dataset df_c = sparkSession.read().option("header", > true).csv("local/fileC.csv").dropDuplicates(); > String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", > "colF"}; > String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"}; > Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left"); > Dataset df_inventory_2 = df_inventory_1.join(df_c, > arrayToSeq(key_join_2), "left"); > df_inventory_2.show(); > {code} > *Error message* > {code:java} > 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 202, Column 18: Expression "agg_isNull_28" is not an rvalue > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 202, Column 18: Expression "agg_isNull_28" is not an rvalue > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) > at > org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170) > at > org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332) > at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053) > at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284) > at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977) > at > org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391) > at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466) > at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385) > at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:390) > at >
[jira] [Resolved] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library
[ https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Brugiere resolved SPARK-25582. - Resolution: Later > Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java > library > - > > Key: SPARK-25582 > URL: https://issues.apache.org/jira/browse/SPARK-25582 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: Thomas Brugiere >Priority: Major > Attachments: fileA.csv, fileB.csv, fileC.csv > > > I have noticed an error that appears in the Spark logs when using the Spark > SQL library in a Java 8 project. > When I run the code below with the attached files as input, I can see the > ERROR below in the application logs. > I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java > project > Note that the same logic implemented with the Python API (pyspark) doesn't > produce any Exception like this. > *Code* > {code:java} > SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local"); > SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate(); > Dataset df_a = sparkSession.read().option("header", > true).csv("local/fileA.csv").dropDuplicates(); > Dataset df_b = sparkSession.read().option("header", > true).csv("local/fileB.csv").dropDuplicates(); > Dataset df_c = sparkSession.read().option("header", > true).csv("local/fileC.csv").dropDuplicates(); > String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", > "colF"}; > String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"}; > Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left"); > Dataset df_inventory_2 = df_inventory_1.join(df_c, > arrayToSeq(key_join_2), "left"); > df_inventory_2.show(); > {code} > *Error message* > {code:java} > 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 202, Column 18: Expression "agg_isNull_28" is not an rvalue > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 202, Column 18: Expression "agg_isNull_28" is not an rvalue > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) > at > org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170) > at > org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332) > at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053) > at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284) > at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977) > at > org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391) > at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466) > at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958) > at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385) > at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:390) > at >
[jira] [Created] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds
Xiao Li created SPARK-25620: --- Summary: WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds Key: SPARK-25620 URL: https://issues.apache.org/jira/browse/SPARK-25620 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure recovery Took 1 min 36 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec
Xiao Li created SPARK-25619: --- Summary: WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec Key: SPARK-25619 URL: https://issues.apache.org/jira/browse/SPARK-25619 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and merge shards in a stream 2 min 15 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637268#comment-16637268 ] Gabor Somogyi commented on SPARK-25501: --- Yeah, it's posted on the dev list. To answer your question the token is not structured streaming specific, it's generic. Any code part uses kafka source/sink in kafka-0-10-sql can take adventage of that. > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > Labels: SPIP > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25618) KafkaContinuousSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false 1 min 1 sec
Xiao Li created SPARK-25618: --- Summary: KafkaContinuousSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false 1 min 1 sec Key: SPARK-25618 URL: https://issues.apache.org/jira/browse/SPARK-25618 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaContinuousSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false 1 min 1 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25617) KafkaContinuousSinkSuite: generic - write big data with small producer buffer 56 secs
Xiao Li created SPARK-25617: --- Summary: KafkaContinuousSinkSuite: generic - write big data with small producer buffer 56 secs Key: SPARK-25617 URL: https://issues.apache.org/jira/browse/SPARK-25617 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaContinuousSinkSuite.generic - write big data with small producer buffer 56 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25616) KafkaSinkSuite: generic - write big data with small producer buffer 57 secs
Xiao Li created SPARK-25616: --- Summary: KafkaSinkSuite: generic - write big data with small producer buffer 57 secs Key: SPARK-25616 URL: https://issues.apache.org/jira/browse/SPARK-25616 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaSinkSuite.generic - write big data with small producer buffer 57 secs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25615) KafkaSinkSuite: streaming - write to non-existing topic 1 min
Xiao Li created SPARK-25615: --- Summary: KafkaSinkSuite: streaming - write to non-existing topic 1 min Key: SPARK-25615 URL: https://issues.apache.org/jira/browse/SPARK-25615 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaSinkSuite.streaming - write to non-existing topic 1 min -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25614) HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds
Xiao Li created SPARK-25614: --- Summary: HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds Key: SPARK-25614 URL: https://issues.apache.org/jira/browse/SPARK-25614 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25613) HiveSparkSubmitSuite: dir 1 min 3 seconds
Xiao Li created SPARK-25613: --- Summary: HiveSparkSubmitSuite: dir 1 min 3 seconds Key: SPARK-25613 URL: https://issues.apache.org/jira/browse/SPARK-25613 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.HiveSparkSubmitSuite.dir 1 mins 3 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25612) CompressionCodecSuite: table-level compression is not set but session-level compressions 47 seconds
Xiao Li created SPARK-25612: --- Summary: CompressionCodecSuite: table-level compression is not set but session-level compressions 47 seconds Key: SPARK-25612 URL: https://issues.apache.org/jira/browse/SPARK-25612 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.CompressionCodecSuite.table-level compression is not set but session-level compressions is set 47 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25611) CompressionCodecSuite: both table-level and session-level compression are set 2 min 20 sec
Xiao Li created SPARK-25611: --- Summary: CompressionCodecSuite: both table-level and session-level compression are set 2 min 20 sec Key: SPARK-25611 URL: https://issues.apache.org/jira/browse/SPARK-25611 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.CompressionCodecSuite.both table-level and session-level compression are set: 2 min 20 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25610) DatasetCacheSuite: cache UDF result correctly 25 seconds
Xiao Li created SPARK-25610: --- Summary: DatasetCacheSuite: cache UDF result correctly 25 seconds Key: SPARK-25610 URL: https://issues.apache.org/jira/browse/SPARK-25610 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.DatasetCacheSuite.cache UDF result correctly 25 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25609) DataFrameSuite: SPARK-22226: splitExpressions should not generate codes beyond 64KB 49 seconds
Xiao Li created SPARK-25609: --- Summary: DataFrameSuite: SPARK-6: splitExpressions should not generate codes beyond 64KB 49 seconds Key: SPARK-25609 URL: https://issues.apache.org/jira/browse/SPARK-25609 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.DataFrameSuite.SPARK-6: splitExpressions should not generate codes beyond 64KB 49 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637248#comment-16637248 ] Thomas Graves commented on SPARK-25501: --- the spip title has "Structured Streaming", is there some reason it is limited to structured streaming and not just a generic get tokens from kafka if someone requests? Perhaps I'm doing a batch job and want to read from kafka > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > Labels: SPIP > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25608) HashAggregationQueryWithControlledFallbackSuite: multiple distinct multiple columns sets 38 seconds
Xiao Li created SPARK-25608: --- Summary: HashAggregationQueryWithControlledFallbackSuite: multiple distinct multiple columns sets 38 seconds Key: SPARK-25608 URL: https://issues.apache.org/jira/browse/SPARK-25608 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.multiple distinct multiple columns sets 38 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25607) HashAggregationQueryWithControlledFallbackSuite: single distinct column set 42 seconds
Xiao Li created SPARK-25607: --- Summary: HashAggregationQueryWithControlledFallbackSuite: single distinct column set 42 seconds Key: SPARK-25607 URL: https://issues.apache.org/jira/browse/SPARK-25607 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.single distinct column set 42 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25501) Kafka delegation token support
[ https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637240#comment-16637240 ] Thomas Graves commented on SPARK-25501: --- did you post SPIP to the dev list, I didn't see it go by but might have missed? > Kafka delegation token support > -- > > Key: SPARK-25501 > URL: https://issues.apache.org/jira/browse/SPARK-25501 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Gabor Somogyi >Priority: Major > Labels: SPIP > > In kafka version 1.1 delegation token support is released. As spark updated > it's kafka client to 2.0.0 now it's possible to implement delegation token > support. Please see description: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25606) DateExpressionsSuite: Hour 1 min
Xiao Li created SPARK-25606: --- Summary: DateExpressionsSuite: Hour 1 min Key: SPARK-25606 URL: https://issues.apache.org/jira/browse/SPARK-25606 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25605) CastSuite: cast string to timestamp 2 mins 31 sec
Xiao Li created SPARK-25605: --- Summary: CastSuite: cast string to timestamp 2 mins 31 sec Key: SPARK-25605 URL: https://issues.apache.org/jira/browse/SPARK-25605 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.catalyst.expressions.CastSuite.cast string to timestamp took 2 min 31 secs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25604) Reduce the overall time costs in Jenkins tests
Xiao Li created SPARK-25604: --- Summary: Reduce the overall time costs in Jenkins tests Key: SPARK-25604 URL: https://issues.apache.org/jira/browse/SPARK-25604 Project: Spark Issue Type: Umbrella Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li Currently, our Jenkins tests took almost 5 hours. To reduce the test time, below is my suggestion: * split the tests to multiple individual Jenkins jobs * tune the confs in the test framework; * for the slow test cases, we can rewrite the test cases or even optimize the source code to speed up them; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637089#comment-16637089 ] Andrei Stankevich commented on SPARK-25062: --- Hi [~dongjoon], yes, it an improvement. > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25538. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22602 [https://github.com/apache/spark/pull/22602] > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.4.0 > > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25538: - Assignee: Marco Gaido > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.4.0 > > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25603) `Projection` expression pushdown through `coalesce` and `limit`
DB Tsai created SPARK-25603: --- Summary: `Projection` expression pushdown through `coalesce` and `limit` Key: SPARK-25603 URL: https://issues.apache.org/jira/browse/SPARK-25603 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.1 Reporter: DB Tsai Assignee: DB Tsai -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed
[ https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636995#comment-16636995 ] Apache Spark commented on SPARK-25602: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22621 > range metrics can be wrong if the result rows are not fully consumed > > > Key: SPARK-25602 > URL: https://issues.apache.org/jira/browse/SPARK-25602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed
[ https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25602: Assignee: Apache Spark (was: Wenchen Fan) > range metrics can be wrong if the result rows are not fully consumed > > > Key: SPARK-25602 > URL: https://issues.apache.org/jira/browse/SPARK-25602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed
[ https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636994#comment-16636994 ] Apache Spark commented on SPARK-25602: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22621 > range metrics can be wrong if the result rows are not fully consumed > > > Key: SPARK-25602 > URL: https://issues.apache.org/jira/browse/SPARK-25602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed
[ https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25602: Assignee: Wenchen Fan (was: Apache Spark) > range metrics can be wrong if the result rows are not fully consumed > > > Key: SPARK-25602 > URL: https://issues.apache.org/jira/browse/SPARK-25602 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636957#comment-16636957 ] Paul Praet commented on SPARK-21402: Still there in Spark 2.3.1. > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-21402 > URL: https://issues.apache.org/jira/browse/SPARK-21402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom >Priority: Major > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Map data; > private Long offset; > } > public class MyDTO { > private long startTime; > private long endTime; > } > {code} > I collect the result the following way - > {code:java} > Encoder myClassEncoder = Encoders.bean(MyClass.class); > Dataset results = raw_df.as(myClassEncoder); > List lst = results.collectAsList(); > {code} > > I do several calculations to get the result I want and the result is correct > all through the way before I collect it. > This is the result for - > {code:java} > results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false); > {code} > |data[2017-07-01].startTime|data[2017-07-01].endTime| > +-+--+ > |1498854000|1498870800 | > This is the result after collecting the reuslts for - > {code:java} > MyClass userData = results.collectAsList().get(0); > MyDTO userDTO = userData.getData().get("2017-07-01"); > System.out.println("userDTO startTime: " + userDTO.getStartTime()); > System.out.println("userDTO endTime: " + userDTO.getEndTime()); > {code} > -- > data startTime: 1498870800 > data endTime: 1498854000 > I tend to believe it is a spark issue. Would love any suggestions on how to > bypass it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636900#comment-16636900 ] Dongjoon Hyun edited comment on SPARK-25062 at 10/3/18 12:43 PM: - Hi, [~stanand99] According to your description, this is an improvement, isn't it? was (Author: dongjoon): Hi, [~petertoth]. According to your description, this is an improvement, isn't it? > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636900#comment-16636900 ] Dongjoon Hyun commented on SPARK-25062: --- Hi, [~petertoth]. According to your description, this is an improvement, isn't it? > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25062) Clean up BlockLocations in FileStatus objects
[ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25062: -- Issue Type: Improvement (was: Bug) > Clean up BlockLocations in FileStatus objects > - > > Key: SPARK-25062 > URL: https://issues.apache.org/jira/browse/SPARK-25062 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2 >Reporter: andrzej.stankev...@gmail.com >Priority: Major > > When Spark lists collection of files it does it on a driver or creates tasks > to list files depending on number of files. here > [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170] > If spark creates tasks to list files each task creates one FileStatus object > per file. Before sending FileStatus to a driver Spark converts FileStatus to > SerializableFileStatus. On driver side Spark turns SerializableFileStatus > back to FileStatus and it also creates BlockLocation object for each > FileStatus using > > {code:java} > new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) > {code} > > After deserialization on a driver side BlockLocation doesn't have a lot of > information that original HDFSBlockLocation had. > > If Spark does listing on a driver side FileStatus object has > HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. > Because of this FileStatus objects takes more memory than if it would created > on executor side. > > Later Spark puts all this objects into _SharedInMemoryCache_ and that cache > takes 2.2x more memory if files were listed on driver side than if they were > listed on executor side. > > In our case _SharedInMemoryCache_ takes 125M when we do scan on executors > and 270M when we do it on a driver. It is for about 19000 files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed
Wenchen Fan created SPARK-25602: --- Summary: range metrics can be wrong if the result rows are not fully consumed Key: SPARK-25602 URL: https://issues.apache.org/jira/browse/SPARK-25602 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25436) Bump master branch version to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636833#comment-16636833 ] Dongjoon Hyun commented on SPARK-25436: --- I updated the versions to 3.0.0 since we don't have 2.5.0 since yesterday. It may look weird because `Bump to 2.5.0-SNAPSHOT` is under `3.0.0` version number. However, it's inevitable for us not to lose this issue in the next release note. > Bump master branch version to 2.5.0-SNAPSHOT > > > Key: SPARK-25436 > URL: https://issues.apache.org/jira/browse/SPARK-25436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 3.0.0 > > > This patch bumps the master branch version to `2.5.0-SNAPSHOT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25436) Bump master branch version to 2.5.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25436: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Bump master branch version to 2.5.0-SNAPSHOT > > > Key: SPARK-25436 > URL: https://issues.apache.org/jira/browse/SPARK-25436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 3.0.0 > > > This patch bumps the master branch version to `2.5.0-SNAPSHOT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide
[ https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16323: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Avoid unnecessary cast when doing integral divide > - > > Key: SPARK-16323 > URL: https://issues.apache.org/jira/browse/SPARK-16323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sean Zhong >Assignee: Marco Gaido >Priority: Minor > Fix For: 3.0.0 > > > This is a follow up of issue SPARK-15776 > *Problem:* > For Integer divide operator div: > {code} > scala> spark.sql("select 6 div 3").explain(true) > ... > == Analyzed Logical Plan == > CAST((6 / 3) AS BIGINT): bigint > Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / > 3) AS BIGINT)#5L] > +- OneRowRelation$ > ... > {code} > For performance reason, we should not do unnecessary cast {{cast(xx as > double)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25423: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maryann Xue >Assignee: Yuming Wang >Priority: Trivial > Labels: starter > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25390) data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25390: -- Target Version/s: 3.0.0 (was: 2.5.0) > data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25390) data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25390: -- Affects Version/s: (was: 2.5.0) 3.0.0 > data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method
[ https://issues.apache.org/jira/browse/SPARK-25444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25444: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Refactor GenArrayData.genCodeToCreateArrayData() method > --- > > Key: SPARK-25444 > URL: https://issues.apache.org/jira/browse/SPARK-25444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 3.0.0 > > > {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a > temporary Java array to create {{ArrayData}}. It can be eliminated by using > {{ArrayData.createArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
[ https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25442: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Support STS to run in K8S deployment with spark deployment mode as cluster > -- > > Key: SPARK-25442 > URL: https://issues.apache.org/jira/browse/SPARK-25442 > Project: Spark > Issue Type: Bug > Components: Kubernetes, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Suryanarayana Garlapati >Priority: Major > > STS fails to start in kubernetes deployments with spark deploy mode as > cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25457) IntegralDivide (div) should not always return long
[ https://issues.apache.org/jira/browse/SPARK-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25457: -- Affects Version/s: (was: 2.5.0) 3.0.0 > IntegralDivide (div) should not always return long > -- > > Key: SPARK-25457 > URL: https://issues.apache.org/jira/browse/SPARK-25457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > The operation {{div}} returns always long. This came from Hive's behavior, > which is different to the one of most of other DBMS (eg. MySQL, Postgres) > which return as a datatype the same of the operands. > This JIRA tracks changing our return type and allowing the users to re-enable > the old behavior using {{spark.sql.legacy.integralDivide.returnBigint}}. > I'll submit a PR for this soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25475) Refactor all benchmark to save the result as a separate file
[ https://issues.apache.org/jira/browse/SPARK-25475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25475: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Refactor all benchmark to save the result as a separate file > > > Key: SPARK-25475 > URL: https://issues.apache.org/jira/browse/SPARK-25475 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This is an umbrella issue to refactor all benchmarks to use a common style > using main-method (instead of `test` method) and saving the result as a > separate file (instead of embedding as comments). This is not only for > consistency, but also for making the benchmark-automation easy. SPARK-25339 > is finished as a reference model. > *Completed* > - FilterPushdownBenchmark.scala (SPARK-25339) > *Candidates* > - AggregateBenchmark.scala > - AvroWriteBenchmark.scala (SPARK-24777) > - ColumnarBatchBenchmark.scala > - CompressionSchemeBenchmark.scala > - DataSourceReadBenchmark.scala > - DataSourceWriteBenchmark.scala (SPARK-24777) > - DatasetBenchmark.scala > - ExternalAppendOnlyUnsafeRowArrayBenchmark.scala > - HashBenchmark.scala > - HashByteArrayBenchmark.scala > - JoinBenchmark.scala > - KryoBenchmark.scala > - MiscBenchmark.scala > - ObjectHashAggregateExecBenchmark.scala > - OrcReadBenchmark.scala > - PrimitiveArrayBenchmark.scala > - SortBenchmark.scala > - SynthBenchmark.scala > - TPCDSQueryBenchmark.scala > - UDTSerializationBenchmark.scala > - UnsafeArrayDataBenchmark.scala > - UnsafeProjectionBenchmark.scala > - WideSchemaBenchmark.scala > Candidates will be reviewed and converted as a subtask of this JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE
[ https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25458: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Support FOR ALL COLUMNS in ANALYZE TABLE > - > > Key: SPARK-25458 > URL: https://issues.apache.org/jira/browse/SPARK-25458 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > Currently, to collect the statistics of all the columns, users need to > specify the names of all the columns when calling the command "ANALYZE TABLE > ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the > following SQL command to achieve it without specifying the column names. > {code:java} >ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25476) Refactor AggregateBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25476: -- Affects Version/s: (was: 2.5.0) 3.0.0 > Refactor AggregateBenchmark to use main method > -- > > Key: SPARK-25476 > URL: https://issues.apache.org/jira/browse/SPARK-25476 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org