[jira] [Updated] (SPARK-25602) SparkPlan.getByteArrayRdd should not consume the input when not necessary

2018-10-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25602:

Summary: SparkPlan.getByteArrayRdd should not consume the input when not 
necessary  (was: range metrics can be wrong if the result rows are not fully 
consumed)

> SparkPlan.getByteArrayRdd should not consume the input when not necessary
> -
>
> Key: SPARK-25602
> URL: https://issues.apache.org/jira/browse/SPARK-25602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25601) Register Grouped aggregate UDF Vectorized UDFs for SQL Statement

2018-10-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25601.
--
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 3.0.0
   2.4.0

Fixed in https://github.com/apache/spark/pull/22620

> Register Grouped aggregate UDF Vectorized UDFs for SQL Statement
> 
>
> Key: SPARK-25601
> URL: https://issues.apache.org/jira/browse/SPARK-25601
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Capable of registering grouped aggregate UDsF and then use it in SQL 
> statement.
> For example,
> {code}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # doctest: +SKIP
> def sum_udf(v):
> return v.sum()
> spark.udf.register("sum_udf", sum_udf)  # doctest: +SKIP
> q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP 
> BY v2"
> spark.sql(q).show()
> +---+
> |sum_udf(v1)|
> +---+
> |  1|
> |  5|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-03 Thread Chongyuan Xiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637657#comment-16637657
 ] 

Chongyuan Xiang edited comment on SPARK-25461 at 10/4/18 12:37 AM:
---

Hi all, thanks for looking into the issue! As a follow up, I noticed that there 
were similar issues with casting to float as well. Just reusing my example and 
changing the return type to be FloatType: 

Script: 

 
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf

values = [None] * 3 + [1.0] * 17 + [2.0] * 600
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)

@pandas_udf(returnType=pyspark.sql.types.FloatType())
def gt_2(column):
return (column >= 2).where(column.notnull())

calculated_df = (df.select(['A'])
.withColumn('potential_bad_col', gt_2('A'))
)

calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | 
(col("A").isNull()))

calculated_df.filter(col("A") == 2).show(30)
{code}
 

Output:
{code:java}
+---+-+---+
| A|potential_bad_col|correct_col|
+---+-+---+
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
+---+-+---+{code}
 


was (Author: xiangcy):
Hi all, thanks for looking into the issue! As a follow up, I noticed that there 
were similar issues with casting to float as well. Just reusing my example and 
changing the return type to be `FloatType`: 

Script: 

 
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf

values = [None] * 3 + [1.0] * 17 + [2.0] * 600
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)

@pandas_udf(returnType=pyspark.sql.types.FloatType())
def gt_2(column):
return (column >= 2).where(column.notnull())

calculated_df = (df.select(['A'])
.withColumn('potential_bad_col', gt_2('A'))
)

calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | 
(col("A").isNull()))

calculated_df.filter(col("A") == 2).show(30)
{code}
 

Output:
{code:java}
+---+-+---+
| A|potential_bad_col|correct_col|
+---+-+---+
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
+---+-+---+{code}
 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| 

[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-03 Thread Chongyuan Xiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637657#comment-16637657
 ] 

Chongyuan Xiang commented on SPARK-25461:
-

Hi all, thanks for looking into the issue! As a follow up, I noticed that there 
were similar issues with casting to float as well. Just reusing my example and 
changing the return type to be `FloatType`: 

Script: 

 
{code:java}
import pandas as pd
import random
import pyspark
from pyspark.sql.functions import col, lit, pandas_udf

values = [None] * 3 + [1.0] * 17 + [2.0] * 600
random.shuffle(values)
pdf = pd.DataFrame({'A': values})
df = spark.createDataFrame(pdf)

@pandas_udf(returnType=pyspark.sql.types.FloatType())
def gt_2(column):
return (column >= 2).where(column.notnull())

calculated_df = (df.select(['A'])
.withColumn('potential_bad_col', gt_2('A'))
)

calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) | 
(col("A").isNull()))

calculated_df.filter(col("A") == 2).show(30)
{code}
 

Output:
{code:java}
+---+-+---+
| A|potential_bad_col|correct_col|
+---+-+---+
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 1.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
|2.0| 0.0| true|
+---+-+---+{code}
 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-03 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626
 ] 

Bryan Cutler edited comment on SPARK-25461 at 10/3/18 11:53 PM:


I filed ARROW-3428, which deals with the incorrect cast from float to bool


was (Author: bryanc):
I file ARROW-3428, which deals with the incorrect cast from float to bool

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-03 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637630#comment-16637630
 ] 

Steven Rand commented on SPARK-25538:
-

Thanks all!

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-03 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626
 ] 

Bryan Cutler commented on SPARK-25461:
--

I file ARROW-3428, which deals with the incorrect cast from float to bool

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-03 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-25586:
---
Issue Type: Bug  (was: Improvement)

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Minor
> Fix For: 3.0.0
>
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> 

[jira] [Assigned] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25637:


Assignee: (was: Apache Spark)

> SparkException: Could not find CoarseGrainedScheduler occurs during the 
> application stop
> 
>
> Key: SPARK-25637
> URL: https://issues.apache.org/jira/browse/SPARK-25637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
> at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> SPARK-14228 fixed these kind of errors but still this is occurring while 
> performing reviveOffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25637:


Assignee: Apache Spark

> SparkException: Could not find CoarseGrainedScheduler occurs during the 
> application stop
> 
>
> Key: SPARK-25637
> URL: https://issues.apache.org/jira/browse/SPARK-25637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Minor
>
> {code:xml}
> 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
> at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> SPARK-14228 fixed these kind of errors but still this is occurring while 
> performing reviveOffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637604#comment-16637604
 ] 

Apache Spark commented on SPARK-25637:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22625

> SparkException: Could not find CoarseGrainedScheduler occurs during the 
> application stop
> 
>
> Key: SPARK-25637
> URL: https://issues.apache.org/jira/browse/SPARK-25637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
> at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> SPARK-14228 fixed these kind of errors but still this is occurring while 
> performing reviveOffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637606#comment-16637606
 ] 

Apache Spark commented on SPARK-25637:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22625

> SparkException: Could not find CoarseGrainedScheduler occurs during the 
> application stop
> 
>
> Key: SPARK-25637
> URL: https://issues.apache.org/jira/browse/SPARK-25637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> 2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
> at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
> at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
> at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197)
> at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> SPARK-14228 fixed these kind of errors but still this is occurring while 
> performing reviveOffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-03 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637605#comment-16637605
 ] 

Marcelo Vanzin commented on SPARK-25586:


bq. This is not a bug

Actually it's a bug if you set your log level to DEBUG and happen to be using 
that class... regardless of the other change.

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Minor
> Fix For: 3.0.0
>
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> 

[jira] [Resolved] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-03 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25586.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22616
[https://github.com/apache/spark/pull/22616]

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Minor
> Fix For: 3.0.0
>
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   

[jira] [Assigned] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-03 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25586:
--

Assignee: Ankur Gupta

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Minor
> Fix For: 3.0.0
>
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> 

[jira] [Created] (SPARK-25637) SparkException: Could not find CoarseGrainedScheduler occurs during the application stop

2018-10-03 Thread Devaraj K (JIRA)
Devaraj K created SPARK-25637:
-

 Summary: SparkException: Could not find CoarseGrainedScheduler 
occurs during the application stop
 Key: SPARK-25637
 URL: https://issues.apache.org/jira/browse/SPARK-25637
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Devaraj K


{code:xml}
2018-10-03 14:51:33 ERROR Inbox:91 - Ignoring error
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:638)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:201)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:197)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.HeartbeatReceiver.org$apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:197)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:120)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
SPARK-14228 fixed these kind of errors but still this is occurring while 
performing reviveOffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23781) Merge YARN and Mesos token renewal code

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637582#comment-16637582
 ] 

Apache Spark commented on SPARK-23781:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22624

> Merge YARN and Mesos token renewal code
> ---
>
> Key: SPARK-23781
> URL: https://issues.apache.org/jira/browse/SPARK-23781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> With the fix for SPARK-23361, the code that handles delegation tokens in 
> Mesos and YARN ends up being very similar.
> We should refactor that code so that both backends are sharing the same code, 
> which also would make it easier for other cluster managers to use that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23781) Merge YARN and Mesos token renewal code

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637581#comment-16637581
 ] 

Apache Spark commented on SPARK-23781:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22624

> Merge YARN and Mesos token renewal code
> ---
>
> Key: SPARK-23781
> URL: https://issues.apache.org/jira/browse/SPARK-23781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> With the fix for SPARK-23361, the code that handles delegation tokens in 
> Mesos and YARN ends up being very similar.
> We should refactor that code so that both backends are sharing the same code, 
> which also would make it easier for other cluster managers to use that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23781) Merge YARN and Mesos token renewal code

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23781:


Assignee: Apache Spark

> Merge YARN and Mesos token renewal code
> ---
>
> Key: SPARK-23781
> URL: https://issues.apache.org/jira/browse/SPARK-23781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> With the fix for SPARK-23361, the code that handles delegation tokens in 
> Mesos and YARN ends up being very similar.
> We should refactor that code so that both backends are sharing the same code, 
> which also would make it easier for other cluster managers to use that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23781) Merge YARN and Mesos token renewal code

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23781:


Assignee: (was: Apache Spark)

> Merge YARN and Mesos token renewal code
> ---
>
> Key: SPARK-23781
> URL: https://issues.apache.org/jira/browse/SPARK-23781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> With the fix for SPARK-23361, the code that handles delegation tokens in 
> Mesos and YARN ends up being very similar.
> We should refactor that code so that both backends are sharing the same code, 
> which also would make it easier for other cluster managers to use that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-10-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637564#comment-16637564
 ] 

Shixiong Zhu commented on SPARK-25005:
--

[~qambard] Not sure about your question. If Kafka consumers fetch nothing, it 
will not update the position.

And yes, if a partition is full with invisible messages, we have to wait for 
timeout. I don't see any API to avoid this.

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-10-03 Thread Quentin Ambard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637560#comment-16637560
 ] 

Quentin Ambard commented on SPARK-25005:


ok I see, great idea, and the consumer ensure us that the position won't be 
updated if pool returns an empty list for any reason?

Also if a partition is full with invisible messages due to transaction abort, 
we'll have to wait for the pool timeout everytime (at least that's what I see 
in my tests) It could hurt throughput, especially if we have to wait for each 
partition. Not sure how we could solve that...

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-10-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637541#comment-16637541
 ] 

Shixiong Zhu commented on SPARK-25005:
--

[~qambard] If `poll` returns and offset gets changed, it means Kafka consumer 
fetches something but all of messages are invisible so consumer return empty.

If `poll` returns but offset doesn't change, it means Kafka fetches nothing 
before timeout. In this case, we just throw "TimeoutException". Spark will 
retry the task or just fail the job. Large GC pause can cause timeout and the 
user should tune the configs to avoid this happening. We cannot do much in 
Spark.

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-10-03 Thread Quentin Ambard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637532#comment-16637532
 ] 

Quentin Ambard commented on SPARK-25005:


How do you make difference between data loss or data missing when .pool() 
doesn't return any value [~zsxwing] ? Correct me if I'm wrong but you could 
lose data in this situation no ?

I think there is a third case here 
[https://github.com/zsxwing/spark/blob/ea804cfe840196519cc9444be9bedf03d10aa11a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L474]
 which is : something went wrong, data is available in kafka but I failed to 
get it.
I've seen it happening when the max.pool size is big with big messages and the 
heap is getting full. Message exist but the jvm lags and the consumer timeout 
before getting the messages

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25636:


Assignee: Apache Spark

> spark-submit swallows the failure reason when there is an error connecting to 
> master
> 
>
> Key: SPARK-25636
> URL: https://issues.apache.org/jira/browse/SPARK-25636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Minor
>
> {code:xml}
> [apache-spark]$ ./bin/spark-submit --verbose --master spark://
> 
> Error: Exception thrown in awaitResult:
> Run with --help for usage help or --verbose for debug output
> {code}
> When the spark submit cannot connect to master, there is no error shown. I 
> think it should display the cause for the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637530#comment-16637530
 ] 

Apache Spark commented on SPARK-25636:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22623

> spark-submit swallows the failure reason when there is an error connecting to 
> master
> 
>
> Key: SPARK-25636
> URL: https://issues.apache.org/jira/browse/SPARK-25636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> [apache-spark]$ ./bin/spark-submit --verbose --master spark://
> 
> Error: Exception thrown in awaitResult:
> Run with --help for usage help or --verbose for debug output
> {code}
> When the spark submit cannot connect to master, there is no error shown. I 
> think it should display the cause for the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25636:


Assignee: (was: Apache Spark)

> spark-submit swallows the failure reason when there is an error connecting to 
> master
> 
>
> Key: SPARK-25636
> URL: https://issues.apache.org/jira/browse/SPARK-25636
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> [apache-spark]$ ./bin/spark-submit --verbose --master spark://
> 
> Error: Exception thrown in awaitResult:
> Run with --help for usage help or --verbose for debug output
> {code}
> When the spark submit cannot connect to master, there is no error shown. I 
> think it should display the cause for the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25636) spark-submit swallows the failure reason when there is an error connecting to master

2018-10-03 Thread Devaraj K (JIRA)
Devaraj K created SPARK-25636:
-

 Summary: spark-submit swallows the failure reason when there is an 
error connecting to master
 Key: SPARK-25636
 URL: https://issues.apache.org/jira/browse/SPARK-25636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Devaraj K


{code:xml}
[apache-spark]$ ./bin/spark-submit --verbose --master spark://

Error: Exception thrown in awaitResult:
Run with --help for usage help or --verbose for debug output
{code}

When the spark submit cannot connect to master, there is no error shown. I 
think it should display the cause for the problem.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25635:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Support selective direct encoding in native ORC write
> -
>
> Key: SPARK-25635
> URL: https://issues.apache.org/jira/browse/SPARK-25635
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
> `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. 
> This is a big huddle to enable dictionary encoding.
> From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
> encoding selectively in a column-wise manner. This issue aims to add that 
> feature by upgrading ORC from 1.5.2 to 1.5.3.
> The followings are the patches in ORC 1.5.3 and this feature is the only one 
> related to Spark directly.
> {code}
> ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data (gopalv)
> ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
> ORC-405. Remove calcite as a dependency from the benchmarks.
> ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
> ORC-383: Parallel builds fails with ConcurrentModificationException
> ORC-382: Apache rat exclusions + add rat check to travis
> ORC-401: Fix incorrect quoting in specification.
> ORC-385. Change RecordReader to extend Closeable.
> ORC-384: [C++] fix memory leak when loading non-ORC files
> ORC-391: [c++] parseType does not accept underscore in the field name
> ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
> by Mithun Radhakrishnan.
> ORC-389: Add ability to not decode Acid metadata columns
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637506#comment-16637506
 ] 

Apache Spark commented on SPARK-25635:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22622

> Support selective direct encoding in native ORC write
> -
>
> Key: SPARK-25635
> URL: https://issues.apache.org/jira/browse/SPARK-25635
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
> `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. 
> This is a big huddle to enable dictionary encoding.
> From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
> encoding selectively in a column-wise manner. This issue aims to add that 
> feature by upgrading ORC from 1.5.2 to 1.5.3.
> The followings are the patches in ORC 1.5.3 and this feature is the only one 
> related to Spark directly.
> {code}
> ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data (gopalv)
> ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
> ORC-405. Remove calcite as a dependency from the benchmarks.
> ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
> ORC-383: Parallel builds fails with ConcurrentModificationException
> ORC-382: Apache rat exclusions + add rat check to travis
> ORC-401: Fix incorrect quoting in specification.
> ORC-385. Change RecordReader to extend Closeable.
> ORC-384: [C++] fix memory leak when loading non-ORC files
> ORC-391: [c++] parseType does not accept underscore in the field name
> ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
> by Mithun Radhakrishnan.
> ORC-389: Add ability to not decode Acid metadata columns
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25635:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Support selective direct encoding in native ORC write
> -
>
> Key: SPARK-25635
> URL: https://issues.apache.org/jira/browse/SPARK-25635
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
> `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. 
> This is a big huddle to enable dictionary encoding.
> From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
> encoding selectively in a column-wise manner. This issue aims to add that 
> feature by upgrading ORC from 1.5.2 to 1.5.3.
> The followings are the patches in ORC 1.5.3 and this feature is the only one 
> related to Spark directly.
> {code}
> ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data (gopalv)
> ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
> ORC-405. Remove calcite as a dependency from the benchmarks.
> ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
> ORC-383: Parallel builds fails with ConcurrentModificationException
> ORC-382: Apache rat exclusions + add rat check to travis
> ORC-401: Fix incorrect quoting in specification.
> ORC-385. Change RecordReader to extend Closeable.
> ORC-384: [C++] fix memory leak when loading non-ORC files
> ORC-391: [c++] parseType does not accept underscore in the field name
> ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
> by Mithun Radhakrishnan.
> ORC-389: Add ability to not decode Acid metadata columns
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25635:
-

Assignee: Dongjoon Hyun

> Support selective direct encoding in native ORC write
> -
>
> Key: SPARK-25635
> URL: https://issues.apache.org/jira/browse/SPARK-25635
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
> `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. 
> This is a big huddle to enable dictionary encoding.
> From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
> encoding selectively in a column-wise manner. This issue aims to add that 
> feature by upgrading ORC from 1.5.2 to 1.5.3.
> The followings are the patches in ORC 1.5.3 and this feature is the only one 
> related to Spark directly.
> {code}
> ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data (gopalv)
> ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
> ORC-405. Remove calcite as a dependency from the benchmarks.
> ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
> ORC-383: Parallel builds fails with ConcurrentModificationException
> ORC-382: Apache rat exclusions + add rat check to travis
> ORC-401: Fix incorrect quoting in specification.
> ORC-385. Change RecordReader to extend Closeable.
> ORC-384: [C++] fix memory leak when loading non-ORC files
> ORC-391: [c++] parseType does not accept underscore in the field name
> ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
> by Mithun Radhakrishnan.
> ORC-389: Add ability to not decode Acid metadata columns
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-03 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25635:
-

 Summary: Support selective direct encoding in native ORC write
 Key: SPARK-25635
 URL: https://issues.apache.org/jira/browse/SPARK-25635
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
`hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. This 
is a big huddle to enable dictionary encoding.

>From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
>encoding selectively in a column-wise manner. This issue aims to add that 
>feature by upgrading ORC from 1.5.2 to 1.5.3.

The followings are the patches in ORC 1.5.3 and this feature is the only one 
related to Spark directly.
{code}
ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
multi-byte data (gopalv)
ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
ORC-405. Remove calcite as a dependency from the benchmarks.
ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
ORC-383: Parallel builds fails with ConcurrentModificationException
ORC-382: Apache rat exclusions + add rat check to travis
ORC-401: Fix incorrect quoting in specification.
ORC-385. Change RecordReader to extend Closeable.
ORC-384: [C++] fix memory leak when loading non-ORC files
ORC-391: [c++] parseType does not accept underscore in the field name
ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25633) Performance Improvement for Drools Spark Jobs.

2018-10-03 Thread Koushik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637485#comment-16637485
 ] 

Koushik commented on SPARK-25633:
-

yes we can connect @ 11 AM ET tomorrow.

> Performance Improvement for Drools Spark Jobs.
> --
>
> Key: SPARK-25633
> URL: https://issues.apache.org/jira/browse/SPARK-25633
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: [link title|http:[link 
> title|http://example.com]//example.com][link title|http://example.com][link 
> title|http://example.com]
>Reporter: Koushik
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: RTTA Performance Issue.pptx
>
>
> We have below region wise compute instance on performance environment. when 
> we reduce the compute instances, we face performance issue
> we have already done code optimization..[link title|http://example.com]
>  
> |Region|Commute Instances on performance environment|
> |MWSW|6|
> |SE|6|
> |W|6|
> |Total|*18*|
>  
> for above combination 98% data process within 30 seconds but when we reduce 
> instances performance degrade.
>  
> we would provide all additional details to respective support team on request



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17895) Improve documentation of "rowsBetween" and "rangeBetween"

2018-10-03 Thread Antonio Pedro de Sousa Vieira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637464#comment-16637464
 ] 

Antonio Pedro de Sousa Vieira commented on SPARK-17895:
---

These changes seem to only have been applied to Scala docs, SparkR and PySpark 
docs are still equal for the two methods. Should this be reopened?

> Improve documentation of "rowsBetween" and "rangeBetween"
> -
>
> Key: SPARK-17895
> URL: https://issues.apache.org/jira/browse/SPARK-17895
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SparkR, SQL
>Reporter: Weiluo Ren
>Assignee: Weiluo Ren
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is an issue found by [~junyangq] when he was fixing SparkR docs.
> In WindowSpec we have two methods "rangeBetween" and "rowsBetween" (See 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala#L82]).
>  However, the description of "rangeBetween" does not clearly differentiate it 
> from "rowsBetween". Even though in 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L109]
>  we have pretty nice description for "RangeFrame" and "RowFrame" which are 
> used in "rangeBetween" and "rowsBetween", I cannot find them in the online 
> Spark scala api. 
> We could add small examples to the description of "rangeBetween" and 
> "rowsBetween" like
> {code}
> val df = Seq(1,1,2).toDF("id")
> df.withColumn("sum", sum('id) over Window.orderBy('id).rangeBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  4|
>  * |  1|  4|
>  * |  2|  2|
>  * +---+---+
> */
> df.withColumn("sum", sum('id) over Window.orderBy('id).rowsBetween(0,1)).show
> /**
>  * It shows
>  * +---+---+
>  * | id|sum|
>  * +---+---+
>  * |  1|  2|
>  * |  1|  3|
>  * |  2|  2|
>  * +---+---+
> */
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25633) Performance Improvement for Drools Spark Jobs.

2018-10-03 Thread Koushik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koushik updated SPARK-25633:

Attachment: RTTA Performance Issue.pptx

> Performance Improvement for Drools Spark Jobs.
> --
>
> Key: SPARK-25633
> URL: https://issues.apache.org/jira/browse/SPARK-25633
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: [link title|http:[link 
> title|http://example.com]//example.com][link title|http://example.com][link 
> title|http://example.com]
>Reporter: Koushik
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: RTTA Performance Issue.pptx
>
>
> We have below region wise compute instance on performance environment. when 
> we reduce the compute instances, we face performance issue
> we have already done code optimization..[link title|http://example.com]
>  
> |Region|Commute Instances on performance environment|
> |MWSW|6|
> |SE|6|
> |W|6|
> |Total|*18*|
>  
> for above combination 98% data process within 30 seconds but when we reduce 
> instances performance degrade.
>  
> we would provide all additional details to respective support team on request



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application

2018-10-03 Thread Ye Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637436#comment-16637436
 ] 

Ye Zhou commented on SPARK-25634:
-

[~felixcheung]  [~vanzin]  [~tgraves]  [~irashid]  [~zsxwing] More comments? 
Thanks

> New Metrics in External Shuffle Service to help identify abusing application
> 
>
> Key: SPARK-25634
> URL: https://issues.apache.org/jira/browse/SPARK-25634
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Ye Zhou
>Priority: Minor
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service. External Shuffle Service is shared by all Spark 
> applications. SPARK-24355 enables the threads reservation to handle 
> non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to 
> avoid OOM in shuffle service which could crash NodeManager. But still some 
> application may generate a large amount of shuffle blocks which could heavily 
> decrease the performance on some shuffle servers. When this abusing behavior 
> happens, it might further decreases the overall performance for other 
> applications if they happen to use the same shuffle servers. We have been 
> seeing issues like this in our cluster, but there is no way for us to figure 
> out which application is abusing shuffle service.
> SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics 
> System. It is better if we can have the following metrics and also metrics 
> divided by applicationID:
> 1. *shuffle server on-heap memory consumption for caching shuffle indexes*
> 2. *breakdown of shuffle indexes caching memory consumption by local 
> executors*
> We can generate metrics when 
> ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will 
> trigger the Cache load. We can roughly be able to get the metrics from the 
> shuffleindexfile size when putting into the cache and moved out from the 
> cache.
> 3. *shuffle server load for shuffle block fetch requests*
> 4. *breakdown of shuffle server block fetch requests load by remote executors*
> We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a 
> new OpenBlocks message is received.
> Open discussion for more metrics that could potentially influence the overall 
> shuffle service performance. 
> We can print out those metrics which are divided by applicationIDs in log, 
> since it is hard to define fixed key and use numerical value for this kind of 
> metrics. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25634) New Metrics in External Shuffle Service to help identify abusing application

2018-10-03 Thread Ye Zhou (JIRA)
Ye Zhou created SPARK-25634:
---

 Summary: New Metrics in External Shuffle Service to help identify 
abusing application
 Key: SPARK-25634
 URL: https://issues.apache.org/jira/browse/SPARK-25634
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.0
Reporter: Ye Zhou


We run Spark on YARN, and deploy Spark external shuffle service as part of YARN 
NM aux service. External Shuffle Service is shared by all Spark applications. 
SPARK-24355 enables the threads reservation to handle non-ChunkFetchRequest. 
SPARK-21501 limits the memory usage for Guava Cache to avoid OOM in shuffle 
service which could crash NodeManager. But still some application may generate 
a large amount of shuffle blocks which could heavily decrease the performance 
on some shuffle servers. When this abusing behavior happens, it might further 
decreases the overall performance for other applications if they happen to use 
the same shuffle servers. We have been seeing issues like this in our cluster, 
but there is no way for us to figure out which application is abusing shuffle 
service.

SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics 
System. It is better if we can have the following metrics and also metrics 
divided by applicationID:

1. *shuffle server on-heap memory consumption for caching shuffle indexes*

2. *breakdown of shuffle indexes caching memory consumption by local executors*

We can generate metrics when 
ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will trigger 
the Cache load. We can roughly be able to get the metrics from the 
shuffleindexfile size when putting into the cache and moved out from the cache.

3. *shuffle server load for shuffle block fetch requests*

4. *breakdown of shuffle server block fetch requests load by remote executors*

We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a 
new OpenBlocks message is received.

Open discussion for more metrics that could potentially influence the overall 
shuffle service performance. 

We can print out those metrics which are divided by applicationIDs in log, 
since it is hard to define fixed key and use numerical value for this kind of 
metrics. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25633) Performance Improvement for Drools Spark Jobs.

2018-10-03 Thread Koushik (JIRA)
Koushik created SPARK-25633:
---

 Summary: Performance Improvement for Drools Spark Jobs.
 Key: SPARK-25633
 URL: https://issues.apache.org/jira/browse/SPARK-25633
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: [link title|http:[link 
title|http://example.com]//example.com][link title|http://example.com][link 
title|http://example.com]
Reporter: Koushik
 Fix For: 2.2.0


We have below region wise compute instance on performance environment. when we 
reduce the compute instances, we face performance issue

we have already done code optimization..[link title|http://example.com]

 
|Region|Commute Instances on performance environment|
|MWSW|6|
|SE|6|
|W|6|
|Total|*18*|

 

for above combination 98% data process within 30 seconds but when we reduce 
instances performance degrade.

 

we would provide all additional details to respective support team on request



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25632:
---

 Summary: KafkaRDDSuite: compacted topic 2 min 5 sec.
 Key: SPARK-25632
 URL: https://issues.apache.org/jira/browse/SPARK-25632
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic

Took 2 min 5 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25631:
---

 Summary: KafkaRDDSuite: basic usage2 min 4 sec
 Key: SPARK-25631
 URL: https://issues.apache.org/jira/browse/SPARK-25631
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li



org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage

Took 2 min 4 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25630) HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25630:
---

 Summary: HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name 
collision while writing files 21 sec
 Key: SPARK-25630
 URL: https://issues.apache.org/jira/browse/SPARK-25630
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.SPARK-8406: Avoids 
name collision while writing files

Took 21 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25629) ParquetFilterSuite: filter pushdown - decimal 16 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25629:
---

 Summary: ParquetFilterSuite: filter pushdown - decimal 16 sec
 Key: SPARK-25629
 URL: https://issues.apache.org/jira/browse/SPARK-25629
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite.filter 
pushdown - decimal

Took 16 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25628) DistributedSuite: recover from repeated node failures during shuffle-reduce 40 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25628:
---

 Summary: DistributedSuite: recover from repeated node failures 
during shuffle-reduce 40 seconds
 Key: SPARK-25628
 URL: https://issues.apache.org/jira/browse/SPARK-25628
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.DistributedSuite.recover from repeated node failures during 
shuffle-reduce 40 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25627) ContinuousStressSuite - 8 mins 13 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25627:
---

 Summary: ContinuousStressSuite - 8 mins 13 sec
 Key: SPARK-25627
 URL: https://issues.apache.org/jira/browse/SPARK-25627
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


ContinuousStressSuite - 8 mins 13 sec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25626) HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25626:
---

 Summary: HiveClientSuites: getPartitionsByFilter returns all 
partitions when hive.metastore.try.direct.sql=false 46 sec
 Key: SPARK-25626
 URL: https://issues.apache.org/jira/browse/SPARK-25626
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


HiveClientSuite.2.3: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  46 sec  Passed
HiveClientSuite.2.2: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  45 sec  Passed
HiveClientSuite.2.1: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  42 sec  Passed
HiveClientSuite.2.0: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  39 sec  Passed
HiveClientSuite.1.2: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  37 sec  Passed
HiveClientSuite.1.1: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  36 sec  Passed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25625) LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization - 33 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25625:
---

 Summary: LogisticRegressionSuite.binary logistic regression with 
intercept with ElasticNet regularization - 33 sec
 Key: SPARK-25625
 URL: https://issues.apache.org/jira/browse/SPARK-25625
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


LogisticRegressionSuite.binary logistic regression with intercept with 
ElasticNet regularization

Took 33 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25624) LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization 56 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25624:
---

 Summary: LogisticRegressionSuite.multinomial logistic regression 
with intercept with elasticnet regularization 56 seconds
 Key: SPARK-25624
 URL: https://issues.apache.org/jira/browse/SPARK-25624
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic 
regression with intercept with elasticnet regularization

Took 56 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25623) LogisticRegressionSuite: multinomial logistic regression with intercept with L1 regularization 1 min 10 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25623:
---

 Summary: LogisticRegressionSuite: multinomial logistic regression 
with intercept with L1 regularization 1 min 10 sec
 Key: SPARK-25623
 URL: https://issues.apache.org/jira/browse/SPARK-25623
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic 
regression with intercept with L1 regularization

Took 1 min 10 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25622) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables with bucket pruning filters - 42 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25622:
---

 Summary: BucketedReadWithHiveSupportSuite: read partitioning 
bucketed tables with bucket pruning filters - 42 seconds
 Key: SPARK-25622
 URL: https://issues.apache.org/jira/browse/SPARK-25622
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning 
bucketed tables with bucket pruning filters

Took 42 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25621) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables having composite filters 45 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25621:
---

 Summary: BucketedReadWithHiveSupportSuite: read partitioning 
bucketed tables having composite filters   45 sec
 Key: SPARK-25621
 URL: https://issues.apache.org/jira/browse/SPARK-25621
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning 
bucketed tables having composite filters

Took 45 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec

2018-10-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25619:

Description: 
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and 
merge shards in a stream 2 min 15 sec

org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split 
and merge shards in a stream 1 min 52 sec.


  was:
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and 
merge shards in a stream 2 min 15 sec



> WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 
> 15 sec
> --
>
> Key: SPARK-25619
> URL: https://issues.apache.org/jira/browse/SPARK-25619
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split 
> and merge shards in a stream 2 min 15 sec
> org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split 
> and merge shards in a stream 1 min 52 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds

2018-10-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25620:

Description: 
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 36 sec.

org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 24 sec.

  was:
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 36 sec.


> WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds
> 
>
> Key: SPARK-25620
> URL: https://issues.apache.org/jira/browse/SPARK-25620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
> recovery
> Took 1 min 36 sec.
> org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure
>  recovery
> Took 1 min 24 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-03 Thread Thomas Brugiere (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Brugiere reopened SPARK-25582:
-

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:390)
>     at 
> 

[jira] [Resolved] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-03 Thread Thomas Brugiere (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Brugiere resolved SPARK-25582.
-
Resolution: Later

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:390)
>     at 
> 

[jira] [Created] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25620:
---

 Summary: WithAggregationKinesisStreamSuite: failure recovery 1 min 
36 seconds
 Key: SPARK-25620
 URL: https://issues.apache.org/jira/browse/SPARK-25620
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 36 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25619:
---

 Summary: WithAggregationKinesisStreamSuite: split and merge shards 
in a stream 2 min 15 sec
 Key: SPARK-25619
 URL: https://issues.apache.org/jira/browse/SPARK-25619
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and 
merge shards in a stream 2 min 15 sec




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-10-03 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637268#comment-16637268
 ] 

Gabor Somogyi commented on SPARK-25501:
---

Yeah, it's posted on the dev list.

To answer your question the token is not structured streaming specific, it's 
generic. Any code part uses kafka source/sink in kafka-0-10-sql can take 
adventage of that.

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>  Labels: SPIP
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25618) KafkaContinuousSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false 1 min 1 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25618:
---

 Summary: KafkaContinuousSourceStressForDontFailOnDataLossSuite: 
stress test for failOnDataLoss=false 1 min 1 sec
 Key: SPARK-25618
 URL: https://issues.apache.org/jira/browse/SPARK-25618
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaContinuousSourceStressForDontFailOnDataLossSuite.stress
 test for failOnDataLoss=false 1 min 1 sec




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25617) KafkaContinuousSinkSuite: generic - write big data with small producer buffer 56 secs

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25617:
---

 Summary: KafkaContinuousSinkSuite: generic - write big data with 
small producer buffer 56 secs
 Key: SPARK-25617
 URL: https://issues.apache.org/jira/browse/SPARK-25617
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaContinuousSinkSuite.generic - write big data 
with small producer buffer 56 seconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25616) KafkaSinkSuite: generic - write big data with small producer buffer 57 secs

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25616:
---

 Summary: KafkaSinkSuite: generic - write big data with small 
producer buffer 57 secs
 Key: SPARK-25616
 URL: https://issues.apache.org/jira/browse/SPARK-25616
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaSinkSuite.generic - write big data with 
small producer buffer 57 secs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25615) KafkaSinkSuite: streaming - write to non-existing topic 1 min

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25615:
---

 Summary: KafkaSinkSuite: streaming - write to non-existing topic 1 
min
 Key: SPARK-25615
 URL: https://issues.apache.org/jira/browse/SPARK-25615
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaSinkSuite.streaming - write to non-existing 
topic 1 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25614) HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25614:
---

 Summary: HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not 
fail with format class not found 38 seconds
 Key: SPARK-25614
 URL: https://issues.apache.org/jira/browse/SPARK-25614
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-18989: DESC TABLE should 
not fail with format class not found 38 seconds




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25613) HiveSparkSubmitSuite: dir 1 min 3 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25613:
---

 Summary: HiveSparkSubmitSuite: dir 1 min 3 seconds
 Key: SPARK-25613
 URL: https://issues.apache.org/jira/browse/SPARK-25613
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.HiveSparkSubmitSuite.dir 1 mins 3 sec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25612) CompressionCodecSuite: table-level compression is not set but session-level compressions 47 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25612:
---

 Summary: CompressionCodecSuite: table-level compression is not set 
but session-level compressions 47 seconds
 Key: SPARK-25612
 URL: https://issues.apache.org/jira/browse/SPARK-25612
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.CompressionCodecSuite.table-level compression is not 
set but session-level compressions is set 47 seconds




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25611) CompressionCodecSuite: both table-level and session-level compression are set 2 min 20 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25611:
---

 Summary: CompressionCodecSuite: both table-level and session-level 
compression are set 2 min 20 sec
 Key: SPARK-25611
 URL: https://issues.apache.org/jira/browse/SPARK-25611
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.CompressionCodecSuite.both table-level and 
session-level compression are set: 2 min 20 sec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25610) DatasetCacheSuite: cache UDF result correctly 25 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25610:
---

 Summary: DatasetCacheSuite: cache UDF result correctly 25 seconds
 Key: SPARK-25610
 URL: https://issues.apache.org/jira/browse/SPARK-25610
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.DatasetCacheSuite.cache UDF result correctly 25 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25609) DataFrameSuite: SPARK-22226: splitExpressions should not generate codes beyond 64KB 49 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25609:
---

 Summary: DataFrameSuite: SPARK-6: splitExpressions should not 
generate codes beyond 64KB 49 seconds
 Key: SPARK-25609
 URL: https://issues.apache.org/jira/browse/SPARK-25609
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.DataFrameSuite.SPARK-6: splitExpressions should not 
generate codes beyond 64KB 49 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-10-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637248#comment-16637248
 ] 

Thomas Graves commented on SPARK-25501:
---

the spip title has "Structured Streaming", is there some reason it is limited 
to structured streaming and not just a generic get tokens from kafka if someone 
requests?  Perhaps I'm doing a batch job and want to read from kafka

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>  Labels: SPIP
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25608) HashAggregationQueryWithControlledFallbackSuite: multiple distinct multiple columns sets 38 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25608:
---

 Summary: HashAggregationQueryWithControlledFallbackSuite: multiple 
distinct multiple columns sets 38 seconds
 Key: SPARK-25608
 URL: https://issues.apache.org/jira/browse/SPARK-25608
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.multiple
 distinct multiple columns sets 38 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25607) HashAggregationQueryWithControlledFallbackSuite: single distinct column set 42 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25607:
---

 Summary: HashAggregationQueryWithControlledFallbackSuite: single 
distinct column set 42 seconds
 Key: SPARK-25607
 URL: https://issues.apache.org/jira/browse/SPARK-25607
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.single
 distinct column set 42 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25501) Kafka delegation token support

2018-10-03 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637240#comment-16637240
 ] 

Thomas Graves commented on SPARK-25501:
---

did you post SPIP to the dev list, I didn't see it go by but might have missed?

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>  Labels: SPIP
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25606) DateExpressionsSuite: Hour 1 min

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25606:
---

 Summary: DateExpressionsSuite: Hour 1 min
 Key: SPARK-25606
 URL: https://issues.apache.org/jira/browse/SPARK-25606
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25605) CastSuite: cast string to timestamp 2 mins 31 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25605:
---

 Summary: CastSuite: cast string to timestamp 2 mins 31 sec
 Key: SPARK-25605
 URL: https://issues.apache.org/jira/browse/SPARK-25605
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.catalyst.expressions.CastSuite.cast string to timestamp 
took 2 min 31 secs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25604) Reduce the overall time costs in Jenkins tests

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25604:
---

 Summary: Reduce the overall time costs in Jenkins tests 
 Key: SPARK-25604
 URL: https://issues.apache.org/jira/browse/SPARK-25604
 Project: Spark
  Issue Type: Umbrella
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


Currently, our Jenkins tests took almost 5 hours. To reduce the test time, 
below is my suggestion:
* split the tests to multiple individual Jenkins jobs
* tune the confs in the test framework;
* for the slow test cases, we can rewrite the test cases or even optimize the 
source code to speed up them;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-03 Thread Andrei Stankevich (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637089#comment-16637089
 ] 

Andrei Stankevich commented on SPARK-25062:
---

Hi [~dongjoon], yes, it an improvement.

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25538) incorrect row counts after distinct()

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25538.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22602
[https://github.com/apache/spark/pull/22602]

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25538) incorrect row counts after distinct()

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25538:
-

Assignee: Marco Gaido

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25603) `Projection` expression pushdown through `coalesce` and `limit`

2018-10-03 Thread DB Tsai (JIRA)
DB Tsai created SPARK-25603:
---

 Summary: `Projection` expression pushdown through `coalesce` and 
`limit`
 Key: SPARK-25603
 URL: https://issues.apache.org/jira/browse/SPARK-25603
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.1
Reporter: DB Tsai
Assignee: DB Tsai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636995#comment-16636995
 ] 

Apache Spark commented on SPARK-25602:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22621

> range metrics can be wrong if the result rows are not fully consumed
> 
>
> Key: SPARK-25602
> URL: https://issues.apache.org/jira/browse/SPARK-25602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25602:


Assignee: Apache Spark  (was: Wenchen Fan)

> range metrics can be wrong if the result rows are not fully consumed
> 
>
> Key: SPARK-25602
> URL: https://issues.apache.org/jira/browse/SPARK-25602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed

2018-10-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636994#comment-16636994
 ] 

Apache Spark commented on SPARK-25602:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22621

> range metrics can be wrong if the result rows are not fully consumed
> 
>
> Key: SPARK-25602
> URL: https://issues.apache.org/jira/browse/SPARK-25602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed

2018-10-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25602:


Assignee: Wenchen Fan  (was: Apache Spark)

> range metrics can be wrong if the result rows are not fully consumed
> 
>
> Key: SPARK-25602
> URL: https://issues.apache.org/jira/browse/SPARK-25602
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-03 Thread Paul Praet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636957#comment-16636957
 ] 

Paul Praet commented on SPARK-21402:


Still there in Spark 2.3.1.

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-03 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636900#comment-16636900
 ] 

Dongjoon Hyun edited comment on SPARK-25062 at 10/3/18 12:43 PM:
-

Hi, [~stanand99]
According to your description, this is an improvement, isn't it?


was (Author: dongjoon):
Hi, [~petertoth].
According to your description, this is an improvement, isn't it?

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-03 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636900#comment-16636900
 ] 

Dongjoon Hyun commented on SPARK-25062:
---

Hi, [~petertoth].
According to your description, this is an improvement, isn't it?

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25062:
--
Issue Type: Improvement  (was: Bug)

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25602) range metrics can be wrong if the result rows are not fully consumed

2018-10-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25602:
---

 Summary: range metrics can be wrong if the result rows are not 
fully consumed
 Key: SPARK-25602
 URL: https://issues.apache.org/jira/browse/SPARK-25602
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25436) Bump master branch version to 2.5.0-SNAPSHOT

2018-10-03 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636833#comment-16636833
 ] 

Dongjoon Hyun commented on SPARK-25436:
---

I updated the versions to 3.0.0 since we don't have 2.5.0 since yesterday.
It may look weird because `Bump to 2.5.0-SNAPSHOT` is under `3.0.0` version 
number.
However, it's inevitable for us not to lose this issue in the next release note.

> Bump master branch version to 2.5.0-SNAPSHOT
> 
>
> Key: SPARK-25436
> URL: https://issues.apache.org/jira/browse/SPARK-25436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> This patch bumps the master branch version to `2.5.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25436) Bump master branch version to 2.5.0-SNAPSHOT

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25436:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Bump master branch version to 2.5.0-SNAPSHOT
> 
>
> Key: SPARK-25436
> URL: https://issues.apache.org/jira/browse/SPARK-25436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> This patch bumps the master branch version to `2.5.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16323:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sean Zhong
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 3.0.0
>
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25423:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Assignee: Yuming Wang
>Priority: Trivial
>  Labels: starter
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25390) data source V2 API refactoring

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25390:
--
Target Version/s: 3.0.0  (was: 2.5.0)

> data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25390) data source V2 API refactoring

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25390:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25444:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Refactor GenArrayData.genCodeToCreateArrayData() method
> ---
>
> Key: SPARK-25444
> URL: https://issues.apache.org/jira/browse/SPARK-25444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 3.0.0
>
>
> {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a 
> temporary Java array to create  {{ArrayData}}. It can be eliminated by using 
> {{ArrayData.createArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25442:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Support STS to run in K8S deployment with spark deployment mode as cluster
> --
>
> Key: SPARK-25442
> URL: https://issues.apache.org/jira/browse/SPARK-25442
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Suryanarayana Garlapati
>Priority: Major
>
> STS fails to start in kubernetes deployments with spark deploy mode as 
> cluster.  Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25457) IntegralDivide (div) should not always return long

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25457:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> IntegralDivide (div) should not always return long
> --
>
> Key: SPARK-25457
> URL: https://issues.apache.org/jira/browse/SPARK-25457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> The operation {{div}} returns always long. This came from Hive's behavior, 
> which is different to the  one of most of other DBMS (eg. MySQL, Postgres) 
> which return as a datatype the same of the operands.
> This JIRA tracks changing our return type and allowing the users to re-enable 
> the old behavior using {{spark.sql.legacy.integralDivide.returnBigint}}.
> I'll submit a PR for this soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25475) Refactor all benchmark to save the result as a separate file

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25475:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Refactor all benchmark to save the result as a separate file
> 
>
> Key: SPARK-25475
> URL: https://issues.apache.org/jira/browse/SPARK-25475
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is an umbrella issue to refactor all benchmarks to use a common style 
> using main-method (instead of `test` method) and saving the result as a 
> separate file (instead of embedding as comments). This is not only for 
> consistency, but also for making the benchmark-automation easy. SPARK-25339 
> is finished as a reference model.
> *Completed*
> - FilterPushdownBenchmark.scala (SPARK-25339)
> *Candidates*
> - AggregateBenchmark.scala
> - AvroWriteBenchmark.scala (SPARK-24777)
> - ColumnarBatchBenchmark.scala
> - CompressionSchemeBenchmark.scala
> - DataSourceReadBenchmark.scala
> - DataSourceWriteBenchmark.scala (SPARK-24777)
> - DatasetBenchmark.scala
> - ExternalAppendOnlyUnsafeRowArrayBenchmark.scala
> - HashBenchmark.scala
> - HashByteArrayBenchmark.scala
> - JoinBenchmark.scala
> - KryoBenchmark.scala
> - MiscBenchmark.scala
> - ObjectHashAggregateExecBenchmark.scala
> - OrcReadBenchmark.scala
> - PrimitiveArrayBenchmark.scala
> - SortBenchmark.scala
> - SynthBenchmark.scala
> - TPCDSQueryBenchmark.scala
> - UDTSerializationBenchmark.scala
> - UnsafeArrayDataBenchmark.scala
> - UnsafeProjectionBenchmark.scala
> - WideSchemaBenchmark.scala
> Candidates will be reviewed and converted as a subtask of this JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25458:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Support FOR ALL COLUMNS in ANALYZE TABLE 
> -
>
> Key: SPARK-25458
> URL: https://issues.apache.org/jira/browse/SPARK-25458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, to collect the statistics of all the columns, users need to 
> specify the names of all the columns when calling the command "ANALYZE TABLE 
> ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the 
> following SQL command to achieve it without specifying the column names.
> {code:java}
>ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25476) Refactor AggregateBenchmark to use main method

2018-10-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25476:
--
Affects Version/s: (was: 2.5.0)
   3.0.0

> Refactor AggregateBenchmark to use main method
> --
>
> Key: SPARK-25476
> URL: https://issues.apache.org/jira/browse/SPARK-25476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >