[jira] [Commented] (SPARK-19992) spark-submit on deployment-mode cluster

2017-03-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939867#comment-15939867
 ] 

Saisai Shao commented on SPARK-19992:
-

Sorry I cannot give you valid suggestions without knowing your actual 
environment. Basically running Spark on yarn you don't have to configure 
anything except HADOOP_CONF_DIR specified in spark-env.sh. Other than this 
default configurations should be enough.

You could also send your problem to mail list, I think there will be more users 
who met the same problem before. Here JIRA is mainly used to track dev work of 
Spark, not for questions.

> spark-submit on deployment-mode cluster
> ---
>
> Key: SPARK-19992
> URL: https://issues.apache.org/jira/browse/SPARK-19992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2
> Environment: spark version 2.0.2
> hadoop version 2.6.0
>Reporter: narendra maru
>
> spark version 2.0.2
> hadoop version 2.6.0
> spark -submit command
> "spark-submit --class spark.mongohadoop.testing3 --master yarn --deploy-mode 
> cluster --jars /home/ec2-user/jars/hgmongonew.jar, 
> /home/ec2-user/jars/mongo-hadoop-spark-2.0.1.jar"
> after adding following in
> 1 Spark-default.conf
> spark.executor.extraJavaOptions -Dconfig.fuction.conf 
> spark.yarn.jars=local:/usr/local/spark-2.0.2-bin-hadoop2.6/yarn/*
> spark.eventLog.dir=hdfs://localhost:9000/user/spark/applicationHistory
> spark.eventLog.enabled=true
> 2yarn-site.xml
> 
> yarn.application.classpath
> 
> /usr/local/hadoop-2.6.0/etc/hadoop,
> /usr/local/hadoop-2.6.0/,
> /usr/local/hadoop-2.6.0/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/lib/
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/,
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/tools/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/lib/*,
> /usr/local/spark-2.0.2-bin-hadoop2.6/jars/spark-yarn_2.11-2.0.2.jar
> 
> 
> Error on log:-
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ApplicationMaster
> Error on terminal:-
> diagnostics: Application application_1489673977198_0002 failed 2 times due to 
> AM Container for appattempt_1489673977198_0002_02 exited with exitCode: 1 
> For more detailed output, check application tracking 
> page:http://bdg-hdp-sparkmaster:8088/proxy/application_1489673977198_0002/Then,
>  click on links to logs of each attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19992) spark-submit on deployment-mode cluster

2017-03-23 Thread narendra maru (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939857#comment-15939857
 ] 

narendra maru commented on SPARK-19992:
---

Thanks sean and saisai for your rlyy 

I followed the same instruction of yours but not getting satisfactory result

it will be great if you can help me out with some better suggestions

> spark-submit on deployment-mode cluster
> ---
>
> Key: SPARK-19992
> URL: https://issues.apache.org/jira/browse/SPARK-19992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2
> Environment: spark version 2.0.2
> hadoop version 2.6.0
>Reporter: narendra maru
>
> spark version 2.0.2
> hadoop version 2.6.0
> spark -submit command
> "spark-submit --class spark.mongohadoop.testing3 --master yarn --deploy-mode 
> cluster --jars /home/ec2-user/jars/hgmongonew.jar, 
> /home/ec2-user/jars/mongo-hadoop-spark-2.0.1.jar"
> after adding following in
> 1 Spark-default.conf
> spark.executor.extraJavaOptions -Dconfig.fuction.conf 
> spark.yarn.jars=local:/usr/local/spark-2.0.2-bin-hadoop2.6/yarn/*
> spark.eventLog.dir=hdfs://localhost:9000/user/spark/applicationHistory
> spark.eventLog.enabled=true
> 2yarn-site.xml
> 
> yarn.application.classpath
> 
> /usr/local/hadoop-2.6.0/etc/hadoop,
> /usr/local/hadoop-2.6.0/,
> /usr/local/hadoop-2.6.0/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/lib/
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/,
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/tools/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/lib/*,
> /usr/local/spark-2.0.2-bin-hadoop2.6/jars/spark-yarn_2.11-2.0.2.jar
> 
> 
> Error on log:-
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ApplicationMaster
> Error on terminal:-
> diagnostics: Application application_1489673977198_0002 failed 2 times due to 
> AM Container for appattempt_1489673977198_0002_02 exited with exitCode: 1 
> For more detailed output, check application tracking 
> page:http://bdg-hdp-sparkmaster:8088/proxy/application_1489673977198_0002/Then,
>  click on links to logs of each attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939852#comment-15939852
 ] 

Takeshi Yamamuro commented on SPARK-20073:
--

yea, you need an alias for key;
{code}

scala> val origDf = Seq(("a", 0), ("b", 0), (null, 1)).toDF("key", "value")
scala> val aggDf = origDf.groupBy($"key".as("k")).agg(count("value"))
scala> aggDf.join(origDf, aggDf("k") <=> origDf("key")).show
++++-+  
|   k|count(value)| key|value|
++++-+
|null|   1|null|1|
|   b|   1|   b|0|
|   a|   1|   a|0|
++++-+
{code}

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy| 

[jira] [Updated] (SPARK-19612) Tests failing with timeout

2017-03-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19612:
---
Affects Version/s: (was: 2.1.1)
   2.2.0

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19612) Tests failing with timeout

2017-03-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reopened SPARK-19612:


This seems to be back: saw two recently:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75124
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75127

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Everett Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939844#comment-15939844
 ] 

Everett Anderson commented on SPARK-20073:
--

[~maropu]

Hi! Thanks for taking a look. What's the Scala equivalent of the aliasing 
necessary?

I've tried calling .toDF(column names) on both tables and assigning table 
aliases like this --

{noformat}
val replacedPeople = people.toDF(people.columns:_*).alias("replacedPeople")
val replacedVariantCounts = 
variantCounts.toDF(variantCounts.columns:_*).alias("replacedVariantCounts")
replacedVariantCounts.join(replacedPeople, 
replacedVariantCounts("name")<=>(replacedPeople("name"))).show
{noformat}

but that doesn't seem to work work.


> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | 

[jira] [Commented] (SPARK-20068) Twenty-two column coalesce has pool performance when codegen is open

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939836#comment-15939836
 ] 

Takeshi Yamamuro commented on SPARK-20068:
--

If you could, you'd better to use newer Spark. I think we do not maintenance 
releases for v1.5 or older.

> Twenty-two column coalesce has pool performance when codegen is open
> 
>
> Key: SPARK-20068
> URL: https://issues.apache.org/jira/browse/SPARK-20068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: QQShu1
>
> when the coalesce has Twenty-one  column the select performance is well, but 
> the Twenty-two column coalesce the select performance is pool. 
> I use the 22 column coalesce select commend is that : 
> SELECT  
> SUM(COALESCE(col1,0)) AS CUSTOM1, 
> SUM(COALESCE(col2,0)) AS CUSTOM2, 
> SUM(COALESCE(col3,0)) AS CUSTOM3, 
> SUM(COALESCE(col4,0)) AS CUSTOM4, 
> SUM(COALESCE(col5,0)) AS CUSTOM5, 
> SUM(COALESCE(col6,0)) AS CUSTOM6, 
> SUM(COALESCE(col7,0)) AS CUSTOM7, 
> SUM(COALESCE(col8,0)) AS CUSTOM8, 
> SUM(COALESCE(col9,0)) AS CUSTOM9, 
> SUM(COALESCE(col10,0)) AS CUSTOM10, 
> SUM(COALESCE(col11,0)) AS CUSTOM11, 
> SUM(COALESCE(col12,0)) AS CUSTOM12, 
> SUM(COALESCE(col13,0)) AS CUSTOM13, 
> SUM(COALESCE(col14,0)) AS CUSTOM14, 
> SUM(COALESCE(col15,0)) AS CUSTOM15, 
> SUM(COALESCE(col16,0)) AS CUSTOM16, 
> SUM(COALESCE(col17,0)) AS CUSTOM17, 
> SUM(COALESCE(col18,0)) AS CUSTOM18, 
> SUM(COALESCE(col19,0)) AS CUSTOM19, 
> SUM(COALESCE(col20,0)) AS CUSTOM20, 
> SUM(COALESCE(col21,0)) AS CUSTOM21, 
> SUM(COALESCE(col22,0)) AS CUSTOM22
> FROM txt_table ;
>  the 21 column select commend it that:
> SELECT  
> SUM(COALESCE(col1,0)) AS CUSTOM1, 
> SUM(COALESCE(col2,0)) AS CUSTOM2, 
> SUM(COALESCE(col3,0)) AS CUSTOM3, 
> SUM(COALESCE(col4,0)) AS CUSTOM4, 
> SUM(COALESCE(col5,0)) AS CUSTOM5, 
> SUM(COALESCE(col6,0)) AS CUSTOM6, 
> SUM(COALESCE(col7,0)) AS CUSTOM7, 
> SUM(COALESCE(col8,0)) AS CUSTOM8, 
> SUM(COALESCE(col9,0)) AS CUSTOM9, 
> SUM(COALESCE(col10,0)) AS CUSTOM10, 
> SUM(COALESCE(col11,0)) AS CUSTOM11, 
> SUM(COALESCE(col12,0)) AS CUSTOM12, 
> SUM(COALESCE(col13,0)) AS CUSTOM13, 
> SUM(COALESCE(col14,0)) AS CUSTOM14, 
> SUM(COALESCE(col15,0)) AS CUSTOM15, 
> SUM(COALESCE(col16,0)) AS CUSTOM16, 
> SUM(COALESCE(col17,0)) AS CUSTOM17, 
> SUM(COALESCE(col18,0)) AS CUSTOM18, 
> SUM(COALESCE(col19,0)) AS CUSTOM19, 
> SUM(COALESCE(col20,0)) AS CUSTOM20, 
> SUM(COALESCE(col21,0)) AS CUSTOM21
> FROM txt_table ;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939831#comment-15939831
 ] 

Takeshi Yamamuro commented on SPARK-20073:
--

If you use `===` instead of `<=>`, you get a warning message below;
{code}
scala> val aggDf = sql("""SELECT key, COUNT(value) FROM origTable GROUP BY 
key""")
scala> aggDf.join(origDf, aggDf("key") === origDf("key")).show
17/03/24 13:45:04 WARN Column: Constructing trivially true equals predicate, 
'key#86 = key#86'. Perhaps you need to use aliases.
{code}
So, we probably need to show the same warning in this case. Or, we might 
implicitly add an alias in this case (I'm not 100% sure we could though).

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> 

[jira] [Comment Edited] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939822#comment-15939822
 ] 

Takeshi Yamamuro edited comment on SPARK-20073 at 3/24/17 5:16 AM:
---

I think this is the known issue and you need to assign an alias in this case.
This is like;
{code}

scala> sql("""SET spark.sql.crossJoin.enabled = true""")
scala> val origDf = Seq(("a", 0), ("b", 0), (null, 1)).toDF("key", "value")
scala> origDf.createOrReplaceTempView("origTable")
scala> // Add an alias for `key`
scala> val aggDf = sql("""SELECT key AS k, COUNT(value) FROM origTable GROUP BY 
key""")

scala> aggDf.join(origDf, aggDf("k") <=> origDf("key")).show
++++-+  
|   k|count(value)| key|value|
++++-+
|null|   1|null|1|
|   b|   1|   b|0|
|   a|   1|   a|0|
++++-+
{code}


was (Author: maropu):
I think this is the known issue and you need to assign an alias for name in 
`variantCounts`;
This is like;
{code}

scala> sql("""SET spark.sql.crossJoin.enabled = true""")
scala> val origDf = Seq(("a", 0), ("b", 0), (null, 1)).toDF("key", "value")
scala> origDf.createOrReplaceTempView("origTable")
scala> // Add an alias for `key`
scala> val aggDf = sql("""SELECT key AS k, COUNT(value) FROM origTable GROUP BY 
key""")

scala> aggDf.join(origDf, aggDf("k") <=> origDf("key")).show
++++-+  
|   k|count(value)| key|value|
++++-+
|null|   1|null|1|
|   b|   1|   b|0|
|   a|   1|   a|0|
++++-+
{code}

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // 

[jira] [Comment Edited] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939822#comment-15939822
 ] 

Takeshi Yamamuro edited comment on SPARK-20073 at 3/24/17 5:15 AM:
---

I think this is the known issue and you need to assign an alias for name in 
`variantCounts`;
This is like;
{code}

scala> sql("""SET spark.sql.crossJoin.enabled = true""")
scala> val origDf = Seq(("a", 0), ("b", 0), (null, 1)).toDF("key", "value")
scala> origDf.createOrReplaceTempView("origTable")
scala> // Add an alias for `key`
scala> val aggDf = sql("""SELECT key AS k, COUNT(value) FROM origTable GROUP BY 
key""")

scala> aggDf.join(origDf, aggDf("k") <=> origDf("key")).show
++++-+  
|   k|count(value)| key|value|
++++-+
|null|   1|null|1|
|   b|   1|   b|0|
|   a|   1|   a|0|
++++-+
{code}


was (Author: maropu):
I think this is the known issue and you need to assign an alias for name in 
`variantCounts`;
This is like;
{code}

scala> sql("""SET spark.sql.crossJoin.enabled = true""")

scala> val origDf = Seq(("a", 0), ("b", 0), (null, 1)).toDF("key", "value")
scala> origDf.createOrReplaceTempView("origTable")

scala> // Add an alias for `key`
scala> val aggDf = sql("""SELECT key AS k, COUNT(value) FROM origTable GROUP BY 
key""")

scala> aggDf.join(origDf, aggDf("k") <=> origDf("key")).show
++++-+  
|   k|count(value)| key|value|
++++-+
|null|   1|null|1|
|   b|   1|   b|0|
|   a|   1|   a|0|
++++-+
{code}

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table 

[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939822#comment-15939822
 ] 

Takeshi Yamamuro commented on SPARK-20073:
--

I think this is the known issue and you need to assign an alias for name in 
`variantCounts`;
This is like;
{code}

scala> sql("""SET spark.sql.crossJoin.enabled = true""")

scala> val origDf = Seq(("a", 0), ("b", 0), (null, 1)).toDF("key", "value")
scala> origDf.createOrReplaceTempView("origTable")

scala> // Add an alias for `key`
scala> val aggDf = sql("""SELECT key AS k, COUNT(value) FROM origTable GROUP BY 
key""")

scala> aggDf.join(origDf, aggDf("k") <=> origDf("key")).show
++++-+  
|   k|count(value)| key|value|
++++-+
|null|   1|null|1|
|   b|   1|   b|0|
|   a|   1|   a|0|
++++-+
{code}

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| 

[jira] [Commented] (SPARK-20068) Twenty-two column coalesce has pool performance when codegen is open

2017-03-23 Thread QQShu1 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939820#comment-15939820
 ] 

QQShu1 commented on SPARK-20068:


@Takeshi Yamamuro thanks for your answer. V2.1 doesn`t have this issue.

> Twenty-two column coalesce has pool performance when codegen is open
> 
>
> Key: SPARK-20068
> URL: https://issues.apache.org/jira/browse/SPARK-20068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: QQShu1
>
> when the coalesce has Twenty-one  column the select performance is well, but 
> the Twenty-two column coalesce the select performance is pool. 
> I use the 22 column coalesce select commend is that : 
> SELECT  
> SUM(COALESCE(col1,0)) AS CUSTOM1, 
> SUM(COALESCE(col2,0)) AS CUSTOM2, 
> SUM(COALESCE(col3,0)) AS CUSTOM3, 
> SUM(COALESCE(col4,0)) AS CUSTOM4, 
> SUM(COALESCE(col5,0)) AS CUSTOM5, 
> SUM(COALESCE(col6,0)) AS CUSTOM6, 
> SUM(COALESCE(col7,0)) AS CUSTOM7, 
> SUM(COALESCE(col8,0)) AS CUSTOM8, 
> SUM(COALESCE(col9,0)) AS CUSTOM9, 
> SUM(COALESCE(col10,0)) AS CUSTOM10, 
> SUM(COALESCE(col11,0)) AS CUSTOM11, 
> SUM(COALESCE(col12,0)) AS CUSTOM12, 
> SUM(COALESCE(col13,0)) AS CUSTOM13, 
> SUM(COALESCE(col14,0)) AS CUSTOM14, 
> SUM(COALESCE(col15,0)) AS CUSTOM15, 
> SUM(COALESCE(col16,0)) AS CUSTOM16, 
> SUM(COALESCE(col17,0)) AS CUSTOM17, 
> SUM(COALESCE(col18,0)) AS CUSTOM18, 
> SUM(COALESCE(col19,0)) AS CUSTOM19, 
> SUM(COALESCE(col20,0)) AS CUSTOM20, 
> SUM(COALESCE(col21,0)) AS CUSTOM21, 
> SUM(COALESCE(col22,0)) AS CUSTOM22
> FROM txt_table ;
>  the 21 column select commend it that:
> SELECT  
> SUM(COALESCE(col1,0)) AS CUSTOM1, 
> SUM(COALESCE(col2,0)) AS CUSTOM2, 
> SUM(COALESCE(col3,0)) AS CUSTOM3, 
> SUM(COALESCE(col4,0)) AS CUSTOM4, 
> SUM(COALESCE(col5,0)) AS CUSTOM5, 
> SUM(COALESCE(col6,0)) AS CUSTOM6, 
> SUM(COALESCE(col7,0)) AS CUSTOM7, 
> SUM(COALESCE(col8,0)) AS CUSTOM8, 
> SUM(COALESCE(col9,0)) AS CUSTOM9, 
> SUM(COALESCE(col10,0)) AS CUSTOM10, 
> SUM(COALESCE(col11,0)) AS CUSTOM11, 
> SUM(COALESCE(col12,0)) AS CUSTOM12, 
> SUM(COALESCE(col13,0)) AS CUSTOM13, 
> SUM(COALESCE(col14,0)) AS CUSTOM14, 
> SUM(COALESCE(col15,0)) AS CUSTOM15, 
> SUM(COALESCE(col16,0)) AS CUSTOM16, 
> SUM(COALESCE(col17,0)) AS CUSTOM17, 
> SUM(COALESCE(col18,0)) AS CUSTOM18, 
> SUM(COALESCE(col19,0)) AS CUSTOM19, 
> SUM(COALESCE(col20,0)) AS CUSTOM20, 
> SUM(COALESCE(col21,0)) AS CUSTOM21
> FROM txt_table ;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null

2017-03-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19959:
---

Assignee: Kazuaki Ishizaki

> df[java.lang.Long].collect throws NullPointerException if df includes null
> --
>
> Key: SPARK-19959
> URL: https://issues.apache.org/jira/browse/SPARK-19959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> The following program throws {{NullPointerException}} during the execution of 
> Java code generated by the wholestage codegen.
> {code:java}
> sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
> {code}
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null

2017-03-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19959.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 17302
[https://github.com/apache/spark/pull/17302]

> df[java.lang.Long].collect throws NullPointerException if df includes null
> --
>
> Key: SPARK-19959
> URL: https://issues.apache.org/jira/browse/SPARK-19959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> The following program throws {{NullPointerException}} during the execution of 
> Java code generated by the wholestage codegen.
> {code:java}
> sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
> {code}
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2017-03-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17204:

Fix Version/s: 2.0.3

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> 

[jira] [Assigned] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20043:


Assignee: (was: Apache Spark)

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939774#comment-15939774
 ] 

Apache Spark commented on SPARK-20043:
--

User 'facaiy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17407

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20043:


Assignee: Apache Spark

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>Assignee: Apache Spark
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20009:


Assignee: (was: Apache Spark)

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939748#comment-15939748
 ] 

Apache Spark commented on SPARK-20009:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17406

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20009:


Assignee: Apache Spark

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2017-03-23 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939625#comment-15939625
 ] 

Liang-Chi Hsieh commented on SPARK-14083:
-

[~maropu] Thanks! That's great!

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20079) Re registration of AM hangs spark cluster in yarn-client mode

2017-03-23 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-20079:
---

 Summary: Re registration of AM hangs spark cluster in yarn-client 
mode
 Key: SPARK-20079
 URL: https://issues.apache.org/jira/browse/SPARK-20079
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0
Reporter: Guoqiang Li



1. Start cluster

echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | 
./bin/spark-shell  --master yarn-client --executor-cores 1 --conf 
spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true 
--conf spark.dynamicAllocation.maxExecutors=2 

2.  Kill the AM process when a stage is scheduled. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-03-23 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939572#comment-15939572
 ] 

Nathan Howell commented on SPARK-19641:
---

Please pick it up if you have cycles and want to take it over, otherwise I'll 
get to it later next week. Thanks!

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accept

2017-03-23 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939571#comment-15939571
 ] 

Yan Facai (颜发才) edited comment on SPARK-20043 at 3/24/17 2:15 AM:
--

The bug can be reproduced.

I'd like to work on it.


was (Author: facai):
The bug can be reproduced by:

```scala
  test("cross validation with decision tree") {
val dt = new DecisionTreeClassifier()
val dtParamMaps = new ParamGridBuilder()
  .addGrid(dt.impurity, Array("Gini", "Entropy"))
  .build()
val eval = new BinaryClassificationEvaluator
val cv = new CrossValidator()
  .setEstimator(dt)
  .setEstimatorParamMaps(dtParamMaps)
  .setEvaluator(eval)
  .setNumFolds(3)
val cvModel = cv.fit(dataset)

// copied model must have the same paren.
val cv2 = testDefaultReadWrite(cvModel, testParams = false)
  }
```

I'd like to work on it.

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-23 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939571#comment-15939571
 ] 

Yan Facai (颜发才) commented on SPARK-20043:
-

The bug can be reproduced by:

```scala
  test("cross validation with decision tree") {
val dt = new DecisionTreeClassifier()
val dtParamMaps = new ParamGridBuilder()
  .addGrid(dt.impurity, Array("Gini", "Entropy"))
  .build()
val eval = new BinaryClassificationEvaluator
val cv = new CrossValidator()
  .setEstimator(dt)
  .setEstimatorParamMaps(dtParamMaps)
  .setEvaluator(eval)
  .setNumFolds(3)
val cvModel = cv.fit(dataset)

// copied model must have the same paren.
val cv2 = testDefaultReadWrite(cvModel, testParams = false)
  }
```

I'd like to work on it.

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-03-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939570#comment-15939570
 ] 

Hyukjin Kwon commented on SPARK-19641:
--

Not sure, just IMHO, it sounds not super urgent one because schema inference is 
not recommended in the production up to my knowledge. I think it is fine to 
take your time :).

Otherwise, I am willing to pick up your commit and open a PR which should 
credit to you if you want.

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are

2017-03-23 Thread 颜发才

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Facai (颜发才) updated SPARK-20043:

Comment: was deleted

(was: [~zsellami] could you give an example of your code?

I try to reproduce the bug, 
```scala
val dt = new DecisionTreeRegressor()

val paramMaps = new ParamGridBuilder()
.addGrid(dt.impurity, Array("Gini", "Entropy"))
.build()
```
however, IiiegalArgumentException is thrown as Gini is not a valid parameter.)

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-23 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939560#comment-15939560
 ] 

Yan Facai (颜发才) commented on SPARK-20043:
-

[~zsellami] could you give an example of your code?

I try to reproduce the bug, 
```scala
val dt = new DecisionTreeRegressor()

val paramMaps = new ParamGridBuilder()
.addGrid(dt.impurity, Array("Gini", "Entropy"))
.build()
```
however, IiiegalArgumentException is thrown as Gini is not a valid parameter.

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19970) Table owner should be USER instead of PRINCIPAL in kerberized clusters

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939551#comment-15939551
 ] 

Apache Spark commented on SPARK-19970:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17405

> Table owner should be USER instead of PRINCIPAL in kerberized clusters
> --
>
> Key: SPARK-19970
> URL: https://issues.apache.org/jira/browse/SPARK-19970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.1, 2.2.0
>
>
> In the kerberized hadoop cluster, when Spark creates tables, the owner of 
> tables are filled with PRINCIPAL strings instead of USER names. This is 
> inconsistent with Hive and causes problems when using ROLE in Hive. We had 
> better to fix this.
> *BEFORE*
> {code}
> scala> sql("create table t(a int)").show
> scala> sql("desc formatted t").show(false)
> ...
> |Owner:  |sp...@example.com   
>   |   |
> {code}
> *AFTER*
> {code}
> scala> sql("create table t(a int)").show
> scala> sql("desc formatted t").show(false)
> ...
> |Owner:  |spark | 
>   |
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19636.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17108
[https://github.com/apache/spark/pull/17108]

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.2.0
>
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-03-23 Thread Nathan Howell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939547#comment-15939547
 ] 

Nathan Howell commented on SPARK-19641:
---

[~hyukjin.kwon], I'm super busy through next Tuesday. I can get it open it 
before then but probably won't have time to do any work on it until later in 
the week. Are you trying to get this in before the 2.2 branch?

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-23 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939533#comment-15939533
 ] 

Yan Facai (颜发才) commented on SPARK-20043:
-

Perhaps it's better to convert impurity Type after setting method is invoked. 

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema

2017-03-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939530#comment-15939530
 ] 

Hyukjin Kwon commented on SPARK-19641:
--

[~NathanHowell], I just happened to revisit this. Are you going to open a PR 
with the change?

> JSON schema inference in DROPMALFORMED mode produces incorrect schema
> -
>
> Key: SPARK-19641
> URL: https://issues.apache.org/jira/browse/SPARK-19641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nathan Howell
>
> In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no 
> columns. This occurs when one document contains a valid JSON value (such as a 
> string or number) and the other documents contain objects or arrays.
> When the default case in {{JsonInferSchema.compatibleRootType}} is reached 
> when merging a {{StringType}} and a {{StructType}} the resulting type will be 
> a {{StringType}}, which is then discarded because a {{StructType}} is 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19625) Authorization Support(on all operations not only DDL) in Spark Sql version 2.1.0

2017-03-23 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19625.
--
Resolution: Duplicate

It sounds a duplicate of SPARK-8321.

Both JIRAs that the PRs indiciate are resolved as {{Won't Fix}}.

> Authorization Support(on all operations not only DDL) in Spark Sql version 
> 2.1.0
> 
>
> Key: SPARK-19625
> URL: https://issues.apache.org/jira/browse/SPARK-19625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: 翟玉勇
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19567) Support some Schedulable variables immutability and access

2017-03-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19567.

   Resolution: Fixed
 Assignee: Eren Avsarogullari
Fix Version/s: 2.2.0

> Support some Schedulable variables immutability and access
> --
>
> Key: SPARK-19567
> URL: https://issues.apache.org/jira/browse/SPARK-19567
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Minor
> Fix For: 2.2.0
>
>
> Support some Schedulable variables immutability and access
> Some Schedulable variables need refactoring for immutability and access 
> modifiers as follows:
> - from vars to vals(if there is no requirement): This is important to support 
> immutability as much as possible. 
> Sample => Pool: weight, minShare, priority, name and 
> taskSetSchedulingAlgorithm.
> - access modifiers: Specially, vars access needs to be restricted from other 
> parts of codebase to prevent potential side effects. Sample: 
> Sample => TaskSetManager: tasksSuccessful, totalResultSize, calculatedTasks 
> etc...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-03-23 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939511#comment-15939511
 ] 

Kazuaki Ishizaki commented on SPARK-19372:
--

I implemented the code to take care of it, and am waiting for the review.

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-03-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939480#comment-15939480
 ] 

Hyukjin Kwon commented on SPARK-19372:
--

I have seen this too before.

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10849) Allow user to specify database column type for data frame fields when writing data to jdbc data sources.

2017-03-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-10849.
-
   Resolution: Fixed
 Assignee: Suresh Thalamati
Fix Version/s: 2.2.0

> Allow user to specify database column type for data frame fields when writing 
> data to jdbc data sources. 
> -
>
> Key: SPARK-10849
> URL: https://issues.apache.org/jira/browse/SPARK-10849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Suresh Thalamati
>Assignee: Suresh Thalamati
>Priority: Minor
> Fix For: 2.2.0
>
>
> Mapping data frame field type to database column type is addressed to large  
> extent by  adding dialects, and Adding  maxlength option in SPARK-10101 to 
> set the  VARCHAR length size. 
> In some cases it is hard to determine max supported VARCHAR size , For 
> example DB2 Z/OS VARCHAR size depends on the page size.  And some databases 
> also has ROW SIZE limits for VARCHAR.  Specifying default CLOB for all String 
> columns  will likely make read/write slow. 
> Allowing users to specify database type corresponding to the data frame field 
> will be useful in cases where users wants to fine tune mapping for one or two 
> fields, and is fine with default for all other fields .  
> I propose to make the following two properties available for users to set in 
> the data frame metadata when writing to JDBC data sources.
> database.column.type  --  column type to use for create table.
> jdbc.column.type" --  jdbc type to  use for setting null values. 
> Example :
>   val secdf = sc.parallelize( Array(("Apple","Revenue ..."), 
> ("Google","Income:123213"))).toDF("name", "report")
>   val  metadataBuilder = new MetadataBuilder()
>   metadataBuilder.putString("database.column.type", "CLOB(100K)")
>   metadataBuilder.putLong("jdbc.type", java.sql.Types.CLOB)
>   val metadta =  metadataBuilder.build()
>   val secReportDF = secdf.withColumn("report", col("report").as("report", 
> metadata))
>   secReporrDF.write.jdbc("jdbc:mysql:///secdata", "reports", mysqlProps)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11300) Support for string length when writing to JDBC

2017-03-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-11300.
-
Resolution: Duplicate

> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-03-23 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939508#comment-15939508
 ] 

Josh Rosen commented on SPARK-19496:


Let's make sure to document this clearly in the release notes. I just spent a 
bunch of time debugging what turned out to be an invalid pattern which happened 
to work by accident before but now is returning {{null}}. Adding this to the 
release notes may save some pain when folks upgrade. 

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>Assignee: Song Jun
>  Labels: release-notes
> Fix For: 2.2.0
>
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19496) to_date with format has weird behavior

2017-03-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-19496:
---
Labels: release-notes  (was: )

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>Assignee: Song Jun
>  Labels: release-notes
> Fix For: 2.2.0
>
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2017-03-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-10101:
-

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
> Fix For: 2.2.0
>
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2017-03-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-10101.
-
   Resolution: Duplicate
Fix Version/s: 2.2.0

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
> Fix For: 2.2.0
>
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10101) Spark JDBC writer mapping String to TEXT or VARCHAR

2017-03-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-10101:

Comment: was deleted

(was: This has been resolved in the master. If you still hit any bug, please 
open a new JIRA.)

> Spark JDBC writer mapping String to TEXT or VARCHAR
> ---
>
> Key: SPARK-10101
> URL: https://issues.apache.org/jira/browse/SPARK-10101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Rama Mullapudi
> Fix For: 2.2.0
>
>
> Currently JDBC Writer maps String data type to TEXT on database but VARCHAR 
> is ANSI SQL standard hence some of the old databases like Oracle, DB2, 
> Teradata etc does not support TEXT as data type.
> Since VARCHAR needs max length to be specified and different databases 
> support different max value, what will be the best way to implement.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19868) conflict TasksetManager lead to spark stopped

2017-03-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19868:
---
Target Version/s: 2.2.0

> conflict TasksetManager lead to spark stopped
> -
>
> Key: SPARK-19868
> URL: https://issues.apache.org/jira/browse/SPARK-19868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: liujianhui
>
> ##scenario
>  conflict taskSetManager throw an exception which lead to sparkcontext 
> stopped. log as 
> {code}
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 4571114: 4571114.2,4571114.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> the reason for that is the resubmitting of stage conflict with the running 
> stage,the missing task of stage should be resubmit since the zoombie of the 
> tasksetManager assigned by true
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting
>  ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks 
> had failed: 0
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting
>  ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at 
> MainApp.scala:73), which has no missing parents
> {code}
> the executor which the shuffle task ran on was lost
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring
>  possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4
> {code}
> the time of the task set finished and the resubmit of stage
> {code}
> handleSuccessfuleTask
> [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed
>  TaskSet 4571114.1, whose tasks have all completed, from pool 
> resubmit stage
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding
>  task set 4571114.2 with 1 tasks
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20078) Mesos executor configurability for task name and labels

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939436#comment-15939436
 ] 

Apache Spark commented on SPARK-20078:
--

User 'kalvinnchau' has created a pull request for this issue:
https://github.com/apache/spark/pull/17404

> Mesos executor configurability for task name and labels
> ---
>
> Key: SPARK-20078
> URL: https://issues.apache.org/jira/browse/SPARK-20078
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kalvin Chau
>Priority: Minor
>
> Add in the ability to configure the mesos task name as well as add labels to 
> the Mesos ExecutorInfo protobuf message.
> Currently all executors that are spun up are named Task X (where X is the 
> executor number). 
> For centralized logging it would be nice to be able to have SparkJob1 X then 
> Name, as well as allowing users to add any labels they would want.
> In this PR I chose "k1:v1,k2:v2" as the format, colons separating key-value 
> and commas to list out more than one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20078) Mesos executor configurability for task name and labels

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20078:


Assignee: (was: Apache Spark)

> Mesos executor configurability for task name and labels
> ---
>
> Key: SPARK-20078
> URL: https://issues.apache.org/jira/browse/SPARK-20078
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kalvin Chau
>Priority: Minor
>
> Add in the ability to configure the mesos task name as well as add labels to 
> the Mesos ExecutorInfo protobuf message.
> Currently all executors that are spun up are named Task X (where X is the 
> executor number). 
> For centralized logging it would be nice to be able to have SparkJob1 X then 
> Name, as well as allowing users to add any labels they would want.
> In this PR I chose "k1:v1,k2:v2" as the format, colons separating key-value 
> and commas to list out more than one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20078) Mesos executor configurability for task name and labels

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20078:


Assignee: Apache Spark

> Mesos executor configurability for task name and labels
> ---
>
> Key: SPARK-20078
> URL: https://issues.apache.org/jira/browse/SPARK-20078
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kalvin Chau
>Assignee: Apache Spark
>Priority: Minor
>
> Add in the ability to configure the mesos task name as well as add labels to 
> the Mesos ExecutorInfo protobuf message.
> Currently all executors that are spun up are named Task X (where X is the 
> executor number). 
> For centralized logging it would be nice to be able to have SparkJob1 X then 
> Name, as well as allowing users to add any labels they would want.
> In this PR I chose "k1:v1,k2:v2" as the format, colons separating key-value 
> and commas to list out more than one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20078) Mesos executor configurability for task name and labels

2017-03-23 Thread Kalvin Chau (JIRA)
Kalvin Chau created SPARK-20078:
---

 Summary: Mesos executor configurability for task name and labels
 Key: SPARK-20078
 URL: https://issues.apache.org/jira/browse/SPARK-20078
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Kalvin Chau
Priority: Minor


Add in the ability to configure the mesos task name as well as add labels to 
the Mesos ExecutorInfo protobuf message.

Currently all executors that are spun up are named Task X (where X is the 
executor number). 

For centralized logging it would be nice to be able to have SparkJob1 X then 
Name, as well as allowing users to add any labels they would want.

In this PR I chose "k1:v1,k2:v2" as the format, colons separating key-value and 
commas to list out more than one.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-23 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-20050:

Description: 
I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart, such below

* output first run of this application
{code}
key: null value: 1 offset: 101452472
key: null value: 2 offset: 101452473
key: null value: 3 offset: 101452474
key: null value: 4 offset: 101452475
key: null value: 5 offset: 101452476
key: null value: 6 offset: 101452477
key: null value: 7 offset: 101452478
key: null value: 8 offset: 101452479
key: null value: 9 offset: 101452480  // this is a last record before shutdown 
Spark Streaming gracefully
{\code}

* output re-run of this application
{code}
key: null value: 7 offset: 101452478   // duplication
key: null value: 8 offset: 101452479   // duplication
key: null value: 9 offset: 101452480   // duplication
key: null value: 10 offset: 101452481
{\code}

It may cause offsets specified in commitAsync will commit in the head of next 
batch.


  was:
I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart, such below

* output first run of this application
{code}
key: null value: 1 offset: 101452472
key: null value: 2 offset: 101452473
key: null value: 3 offset: 101452474
key: null value: 4 offset: 101452475
key: null value: 5 offset: 101452476
key: null value: 6 offset: 101452477
key: null value: 7 offset: 101452478
key: null value: 8 offset: 101452479
key: null value: 9 offset: 101452480
{\code}

* output re-run of this application
{code}
key: null value: 7 offset: 101452478   // duplication
key: null value: 8 offset: 101452479   // duplication
key: null value: 9 offset: 101452480   // duplication
key: null value: 10 offset: 101452481
{\code}

It may cause offsets specified in commitAsync will commit in the head of next 
batch.



> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // 

[jira] [Updated] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-23 Thread Sasaki Toru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru updated SPARK-20050:

Description: 
I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart, such below

* output first run of this application
{code}
key: null value: 1 offset: 101452472
key: null value: 2 offset: 101452473
key: null value: 3 offset: 101452474
key: null value: 4 offset: 101452475
key: null value: 5 offset: 101452476
key: null value: 6 offset: 101452477
key: null value: 7 offset: 101452478
key: null value: 8 offset: 101452479
key: null value: 9 offset: 101452480
{\code}

* output re-run of this application
{code}
key: null value: 7 offset: 101452478   // duplication
key: null value: 8 offset: 101452479   // duplication
key: null value: 9 offset: 101452480   // duplication
key: null value: 10 offset: 101452481
{\code}

It may cause offsets specified in commitAsync will commit in the head of next 
batch.


  was:
I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such below

{code}
val kafkaStream = KafkaUtils.createDirectStream[String, String](...)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " offset: 
" + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
{\code}

Some records which processed in the last batch before Streaming graceful 
shutdown reprocess in the first batch after Spark Streaming restart.

It may cause offsets specified in commitAsync will commit in the head of next 
batch.



> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-03-23 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939347#comment-15939347
 ] 

Xiao Li commented on SPARK-1:
-

[~josephkb] In the SQL specification, the set operations are merging columns by 
column positions. However, it also provides another keyword {{CORRESPONDING}}. 
If users specify that keyword with set operations, we will merge them by column 
names, regardless of their positions. Do you want this feature? I can do it in 
the next release, since Spark 2.2 will be released soon. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2017-03-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939323#comment-15939323
 ] 

Joseph K. Bradley commented on SPARK-1:
---

[~smilegator] I wouldn't call that result "right."  That's definitely a bug 
unless the SQL specification is really messed up.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7200) Tungsten test suites should fail if memory leak is detected

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7200:
---

Assignee: Apache Spark

> Tungsten test suites should fail if memory leak is detected
> ---
>
> Key: SPARK-7200
> URL: https://issues.apache.org/jira/browse/SPARK-7200
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We should be able to detect whether there are unreturned memory after each 
> suite and fail the suite.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7200) Tungsten test suites should fail if memory leak is detected

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7200:
---

Assignee: (was: Apache Spark)

> Tungsten test suites should fail if memory leak is detected
> ---
>
> Key: SPARK-7200
> URL: https://issues.apache.org/jira/browse/SPARK-7200
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Reporter: Reynold Xin
>
> We should be able to detect whether there are unreturned memory after each 
> suite and fail the suite.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7200) Tungsten test suites should fail if memory leak is detected

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939318#comment-15939318
 ] 

Apache Spark commented on SPARK-7200:
-

User 'jsoltren' has created a pull request for this issue:
https://github.com/apache/spark/pull/17402

> Tungsten test suites should fail if memory leak is detected
> ---
>
> Key: SPARK-7200
> URL: https://issues.apache.org/jira/browse/SPARK-7200
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Reporter: Reynold Xin
>
> We should be able to detect whether there are unreturned memory after each 
> suite and fail the suite.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19970) Table owner should be USER instead of PRINCIPAL in kerberized clusters

2017-03-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19970:
---
Fix Version/s: 2.1.1

> Table owner should be USER instead of PRINCIPAL in kerberized clusters
> --
>
> Key: SPARK-19970
> URL: https://issues.apache.org/jira/browse/SPARK-19970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.1, 2.2.0
>
>
> In the kerberized hadoop cluster, when Spark creates tables, the owner of 
> tables are filled with PRINCIPAL strings instead of USER names. This is 
> inconsistent with Hive and causes problems when using ROLE in Hive. We had 
> better to fix this.
> *BEFORE*
> {code}
> scala> sql("create table t(a int)").show
> scala> sql("desc formatted t").show(false)
> ...
> |Owner:  |sp...@example.com   
>   |   |
> {code}
> *AFTER*
> {code}
> scala> sql("create table t(a int)").show
> scala> sql("desc formatted t").show(false)
> ...
> |Owner:  |spark | 
>   |
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20077) Documentation for ml.stats.Correlation

2017-03-23 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-20077:
--

 Summary: Documentation for ml.stats.Correlation
 Key: SPARK-20077
 URL: https://issues.apache.org/jira/browse/SPARK-20077
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


Now that (Pearson) correlations are available in spark.ml, we need to write 
some documentation to go along with this feature. It can simply be looking at 
the unit tests for example right now.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20076) Python interface for ml.stats.Correlation

2017-03-23 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-20076:
--

 Summary: Python interface for ml.stats.Correlation
 Key: SPARK-20076
 URL: https://issues.apache.org/jira/browse/SPARK-20076
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


The (Pearson) statistics have been exposed with a Dataframe interface as part 
of SPARK-19636 in the Scala interface. We should now make these available in 
Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19876) Add OneTime trigger executor

2017-03-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-19876.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17219
[https://github.com/apache/spark/pull/17219]

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.2.0
>
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20075) Support classifier, packaging in Maven coordinates

2017-03-23 Thread Sean Owen (JIRA)
Sean Owen created SPARK-20075:
-

 Summary: Support classifier, packaging in Maven coordinates
 Key: SPARK-20075
 URL: https://issues.apache.org/jira/browse/SPARK-20075
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, Spark Submit
Affects Versions: 2.1.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


Currently, it's possible to add dependencies to an app using its Maven 
coordinates on the command line: {{group:artifact:version}}. However, really 
Maven coordinates are 5-dimensional: 
{{group:artifact:packaging:classifier:version}}. In some rare but real cases 
it's important to be able to specify the classifier. And while we're at it why 
not try to support packaging?

I have a WIP PR that I'll post soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18364) Expose metrics for YarnShuffleService

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939194#comment-15939194
 ] 

Apache Spark commented on SPARK-18364:
--

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/17401

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18364) Expose metrics for YarnShuffleService

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18364:


Assignee: Apache Spark

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Assignee: Apache Spark
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18364) Expose metrics for YarnShuffleService

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18364:


Assignee: (was: Apache Spark)

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939191#comment-15939191
 ] 

Marcelo Vanzin commented on SPARK-18085:


I'm also keeping the current activity at 
https://github.com/vanzin/spark/commits/shs-ng/HEAD so I don't need to 
constantly add new branch links here going forward.

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Everett Anderson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Everett Anderson updated SPARK-20073:
-
Description: 
It appears that if you try to join tables A and B when B is derived from A and 
you use the eqNullSafe / <=> operator for the join condition, Spark performs a 
Cartesian product.

However, if you perform the join on tables of the same data when they don't 
have a relationship, the expected non-Cartesian product join occurs.

{noformat}
// Create some fake data.

import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions

val peopleRowsRDD = sc.parallelize(Seq(
Row("Fred", 8, 1),
Row("Fred", 8, 2),
Row(null, 10, 3),
Row(null, 10, 4),
Row("Amy", 12, 5),
Row("Amy", 12, 6)))

val peopleSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("group", IntegerType, nullable = true),
StructField("data", IntegerType, nullable = true)))

val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)

people.createOrReplaceTempView("people")

scala> people.show
++-++
|name|group|data|
++-++
|Fred|8|   1|
|Fred|8|   2|
|null|   10|   3|
|null|   10|   4|
| Amy|   12|   5|
| Amy|   12|   6|
++-++

// Now create a derived table from that table. It doesn't matter much what.
val variantCounts = spark.sql("select name, count(distinct(name, group, data)) 
as variant_count from people group by name having variant_count > 1")

variantCounts.show
++-+
|name|variant_count|
++-+
|Fred|2|
|null|2|
| Amy|2|
++-+

// Now try an inner join using the regular equalTo that drops nulls. This works 
fine.

val innerJoinEqualTo = variantCounts.join(people, 
variantCounts("name").equalTo(people("name")))
innerJoinEqualTo.show

++-++-++
|name|variant_count|name|group|data|
++-++-++
|Fred|2|Fred|8|   1|
|Fred|2|Fred|8|   2|
| Amy|2| Amy|   12|   5|
| Amy|2| Amy|   12|   6|
++-++-++

// Okay now lets switch to the <=> operator
//
// If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error like
// "Cartesian joins could be prohibitively expensive and are disabled by 
default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
true;"
//
// if you have enabled them, you'll get the table below.
//
// However, we really don't want or expect a Cartesian product!

val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
variantCounts("name")<=>(people("name")))
innerJoinSqlNullSafeEqOp.show

++-++-++
|name|variant_count|name|group|data|
++-++-++
|Fred|2|Fred|8|   1|
|Fred|2|Fred|8|   2|
|Fred|2|null|   10|   3|
|Fred|2|null|   10|   4|
|Fred|2| Amy|   12|   5|
|Fred|2| Amy|   12|   6|
|null|2|Fred|8|   1|
|null|2|Fred|8|   2|
|null|2|null|   10|   3|
|null|2|null|   10|   4|
|null|2| Amy|   12|   5|
|null|2| Amy|   12|   6|
| Amy|2|Fred|8|   1|
| Amy|2|Fred|8|   2|
| Amy|2|null|   10|   3|
| Amy|2|null|   10|   4|
| Amy|2| Amy|   12|   5|
| Amy|2| Amy|   12|   6|
++-++-++

// Okay, let's try to construct the exact same variantCount table manually
// so it has no relationship to the original.

val variantCountRowsRDD = sc.parallelize(Seq(
Row("Fred", 2),
Row(null, 2),
Row("Amy", 2)))

val variantCountSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("variant_count", IntegerType, nullable = true)))

val manualVariantCounts = spark.createDataFrame(variantCountRowsRDD, 
variantCountSchema)

// Now perform the same join with the null-safe equals operator. This works and 
gives us the expected non-Cartesian product result.

val manualVarCountsInnerJoinSqlNullSafeEqOp = manualVariantCounts.join(people, 
manualVariantCounts("name")<=>(people("name")))
manualVarCountsInnerJoinSqlNullSafeEqOp.show

++-++-++
|name|variant_count|name|group|data|
++-++-++
|Fred|2|Fred|8|   1|
|Fred|2|Fred|8|   2|
| Amy|2| Amy|   12|   5|
| Amy|2| Amy|   12|   6|
|null|2|null|   10|   3|
|null|2|null|   10|   4|

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2017-03-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939156#comment-15939156
 ] 

Michael Armbrust commented on SPARK-10816:
--

Just a quick note for people interested in this topic.  The more advanced API 
that lets you do arbitrary grouped stateful operations with timeouts based on 
processing time or event time has been merged into master.  See [SPARK-19067] 
for more details.

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-03-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939072#comment-15939072
 ] 

Andrew Ash commented on SPARK-19372:


I've seen this as well on parquet files.

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20074) Make buffer size in unsafe external sorter configurable

2017-03-23 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-20074:
---

 Summary: Make buffer size in unsafe external sorter configurable
 Key: SPARK-20074
 URL: https://issues.apache.org/jira/browse/SPARK-20074
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Sital Kedia


Currently, it is hardcoded to 32kb, see - 
https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L123



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-23 Thread Sasaki Toru (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938988#comment-15938988
 ] 

Sasaki Toru edited comment on SPARK-20050 at 3/23/17 6:45 PM:
--

Thank you for your comment, but I can't understand your advice, sorry.

I want to say some offset set in {{commitAsync}} will not commit to Kafka.
I think callback function will be invoked when committed to Kafka completely 
(success or failed),
so I think this function will not be invoked in this case.

If I am wrong, please correct, thanks.


was (Author: sasakitoa):
Thank you for your comment, but I can't understand your advice, sorry.

I want to say some offset set in {{commitAsync}} will not commit to Kafka.
I think callback function will invoke when committed to Kafka completely 
(success or failed),
so I think this function will not be invoked in this case.

If I am wrong, please correct, thanks.

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart.
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-23 Thread Sasaki Toru (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938988#comment-15938988
 ] 

Sasaki Toru commented on SPARK-20050:
-

Thank you for your comment, but I can't understand your advice, sorry.

I want to say some offset set in {{commitAsync}} will not commit to Kafka.
I think callback function will invoke when committed to Kafka completely 
(success or failed),
so I think this function will not be invoked in this case.

If I am wrong, please correct, thanks.

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart.
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19791) Add doc and example for fpgrowth

2017-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19791:
--
Target Version/s: 2.2.0

> Add doc and example for fpgrowth
> 
>
> Key: SPARK-19791
> URL: https://issues.apache.org/jira/browse/SPARK-19791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Add a new section for fpm
> Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19791) Add doc and example for fpgrowth

2017-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19791:
--
Shepherd: Joseph K. Bradley

> Add doc and example for fpgrowth
> 
>
> Key: SPARK-19791
> URL: https://issues.apache.org/jira/browse/SPARK-19791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add a new section for fpm
> Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19791) Add doc and example for fpgrowth

2017-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19791:
-

Assignee: yuhao yang

> Add doc and example for fpgrowth
> 
>
> Key: SPARK-19791
> URL: https://issues.apache.org/jira/browse/SPARK-19791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add a new section for fpm
> Add Example for FPGrowth in scala and Java.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Everett Anderson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Everett Anderson updated SPARK-20073:
-
Labels: correctness  (was: )

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay, let's try to construct the exact same variantCount table manually
> // so it has no relationship to the original.
> val variantCountRowsRDD = sc.parallelize(Seq(
> Row("Fred", 2),
> Row(null, 2),
> Row("Amy", 2)))
> 
> val variantCountSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("variant_count", IntegerType, nullable = true)))
> 
> val manualVariantCounts = 

[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion

2017-03-23 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938813#comment-15938813
 ] 

Travis Hegner commented on SPARK-15176:
---

I'd like to have this feature available as well for a slightly different use 
case.

Currently we have an application that reconciles data from a relational 
database. It would be ideal to place any job within this app that touches the 
RDB into a resource capped pool ("maxShare" was what I instinctively looked 
for), to prevent overloading the remote RDB. The app has many other stages of 
data manipulations which would benefit from utilizing a larger share of the 
cluster, without the risk of hurting the RDB.

Limiting the number of executors and cores per executor works to prevent 
overloading the RDB, but also limits the entire application, causing it to run 
slower, and leaving potential idle resources on the table. Currently, the only 
other option I can think of would be to split the RDB pull into a separate app 
that persists the data to disk temporarily. This seems it would make the 
overall process take even more time, however.

Is anyone still working this issue? Should I look at porting the existing PR to 
2.1.x branch?

> Job Scheduling Within Application Suffers from Priority Inversion
> -
>
> Key: SPARK-15176
> URL: https://issues.apache.org/jira/browse/SPARK-15176
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Nick White
>
> Say I have two pools, and N cores in my cluster:
> * I submit a job to one, which has M >> N tasks
> * N of the M tasks are scheduled
> * I submit a job to the second pool - but none of its tasks get scheduled 
> until a task from the other pool finishes!
> This can lead to unbounded denial-of-service for the second pool - regardless 
> of `minShare` or `weight` settings. Ideally Spark would support a pre-emption 
> mechanism, or an upper bound on a pool's resource usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Everett Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938810#comment-15938810
 ] 

Everett Anderson commented on SPARK-20073:
--

With the local master in spark-shell and cross joins enabled, here are the 
query plans for the cases above:

{noformat}
innerJoinSqlNullSafeEqOp.explain
== Physical Plan ==
CartesianProduct
:- *Filter (variant_count#11L > 1)
:  +- *HashAggregate(keys=[name#3], functions=[count(distinct 
named_struct(name, name#3, group, group#4, data, data#5)#172)])
: +- Exchange hashpartitioning(name#3, 200)
:+- *HashAggregate(keys=[name#3], functions=[partial_count(distinct 
named_struct(name, name#3, group, group#4, data, data#5)#172)])
:   +- *HashAggregate(keys=[name#3, named_struct(name, name#3, group, 
group#4, data, data#5)#172], functions=[])
:  +- Exchange hashpartitioning(name#3, named_struct(name, name#3, 
group, group#4, data, data#5)#172, 200)
: +- *HashAggregate(keys=[name#3, named_struct(name, name#3, 
group, group#4, data, data#5) AS named_struct(name, name#3, group, group#4, 
data, data#5)#172], functions=[])
:+- Scan ExistingRDD[name#3,group#4,data#5]
+- Scan ExistingRDD[name#103,group#104,data#105]
{noformat}

vs

{noformat}
manualVarCountsInnerJoinSqlNullSafeEqOp.explain
== Physical Plan ==
*SortMergeJoin [coalesce(name#139, )], [coalesce(name#3, )], Inner, (name#139 
<=> name#3)
:- *Sort [coalesce(name#139, ) ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(coalesce(name#139, ), 200)
: +- Scan ExistingRDD[name#139,variant_count#140]
+- *Sort [coalesce(name#3, ) ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(coalesce(name#3, ), 200)
  +- Scan ExistingRDD[name#3,group#4,data#5]
{noformat}



> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.

[jira] [Created] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-03-23 Thread Everett Anderson (JIRA)
Everett Anderson created SPARK-20073:


 Summary: Unexpected Cartesian product when using eqNullSafe in 
join with a derived table
 Key: SPARK-20073
 URL: https://issues.apache.org/jira/browse/SPARK-20073
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.1.0, 2.0.2
Reporter: Everett Anderson


It appears that if you try to join tables A and B when B is derived from A and 
you use the eqNullSafe / <=> operator for the join condition, Spark performs a 
Cartesian product.

However, if you perform the join on tables of the same data when they don't 
have a relationship, the expected non-Cartesian product join occurs.

{noformat}
// Create some fake data.

import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions

val peopleRowsRDD = sc.parallelize(Seq(
Row("Fred", 8, 1),
Row("Fred", 8, 2),
Row(null, 10, 3),
Row(null, 10, 4),
Row("Amy", 12, 5),
Row("Amy", 12, 6)))

val peopleSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("group", IntegerType, nullable = true),
StructField("data", IntegerType, nullable = true)))

val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)

people.createOrReplaceTempView("people")

scala> people.show
++-++
|name|group|data|
++-++
|Fred|8|   1|
|Fred|8|   2|
|null|   10|   3|
|null|   10|   4|
| Amy|   12|   5|
| Amy|   12|   6|
++-++

// Now create a derived table from that table. It doesn't matter much what.
val variantCounts = spark.sql("select name, count(distinct(name, group, data)) 
as variant_count from people group by name having variant_count > 1")

variantCounts.show
++-+
|name|variant_count|
++-+
|Fred|2|
|null|2|
| Amy|2|
++-+

// Now try an inner join using the regular equalTo that drops nulls. This works 
fine.

val innerJoinEqualTo = variantCounts.join(people, 
variantCounts("name").equalTo(people("name")))
innerJoinEqualTo.show

++-++-++
|name|variant_count|name|group|data|
++-++-++
|Fred|2|Fred|8|   1|
|Fred|2|Fred|8|   2|
| Amy|2| Amy|   12|   5|
| Amy|2| Amy|   12|   6|
++-++-++

// Okay now lets switch to the <=> operator
//
// If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error like
// "Cartesian joins could be prohibitively expensive and are disabled by 
default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
true;"
//
// if you have enabled them, you'll get the table below.
//
// However, we really don't want or expect a Cartesian product!

val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
variantCounts("name")<=>(people("name")))
innerJoinSqlNullSafeEqOp.show

++-++-++
|name|variant_count|name|group|data|
++-++-++
|Fred|2|Fred|8|   1|
|Fred|2|Fred|8|   2|
|Fred|2|null|   10|   3|
|Fred|2|null|   10|   4|
|Fred|2| Amy|   12|   5|
|Fred|2| Amy|   12|   6|
|null|2|Fred|8|   1|
|null|2|Fred|8|   2|
|null|2|null|   10|   3|
|null|2|null|   10|   4|
|null|2| Amy|   12|   5|
|null|2| Amy|   12|   6|
| Amy|2|Fred|8|   1|
| Amy|2|Fred|8|   2|
| Amy|2|null|   10|   3|
| Amy|2|null|   10|   4|
| Amy|2| Amy|   12|   5|
| Amy|2| Amy|   12|   6|
++-++-++

// Okay, let's try to construct the exact same variantCount table manually
// so it has no relationship to the original.

val variantCountRowsRDD = sc.parallelize(Seq(
Row("Fred", 2),
Row(null, 2),
Row("Amy", 2)))

val variantCountSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("variant_count", IntegerType, nullable = true)))

val manualVariantCounts = spark.createDataFrame(variantCountRowsRDD, 
variantCountSchema)

// Now perform the same join with the null-safe equals operator.
val manualVarCountsInnerJoinSqlNullSafeEqOp = manualVariantCounts.join(people, 
manualVariantCounts("name")<=>(people("name")))
manualVarCountsInnerJoinSqlNullSafeEqOp.show

++-++-++
|name|variant_count|name|group|data|
++-++-++
|Fred|2|Fred|8|   1|
|Fred|2|Fred|8|   2|
| Amy|2| 

[jira] [Resolved] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-23 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-20066.
-
Resolution: Won't Fix

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19591) Add sample weights to decision trees

2017-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19591:
--
Shepherd: Joseph K. Bradley

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>
> Add sample weights to decision trees



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19591) Add sample weights to decision trees

2017-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19591:
-

Assignee: Seth Hendrickson

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Add sample weights to decision trees



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19981:


Assignee: Apache Spark

> Sort-Merge join inserts shuffles when joining dataframes with aliased columns
> -
>
> Key: SPARK-19981
> URL: https://issues.apache.org/jira/browse/SPARK-19981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Allen George
>Assignee: Apache Spark
>
> Performing a sort-merge join with two dataframes - each of which has the join 
> column aliased - causes Spark to insert an unnecessary shuffle.
> Consider the scala test code below, which should be equivalent to the 
> following SQL.
> {code:SQL}
> SELECT * FROM
>   (SELECT number AS aliased from df1) t1
> LEFT JOIN
>   (SELECT number AS aliased from df2) t2
> ON t1.aliased = t2.aliased
> {code}
> {code:scala}
> private case class OneItem(number: Long)
> private case class TwoItem(number: Long, value: String)
> test("join with aliases should not trigger shuffle") {
>   val df1 = sqlContext.createDataFrame(
> Seq(
>   OneItem(0),
>   OneItem(2),
>   OneItem(4)
> )
>   )
>   val partitionedDf1 = df1.repartition(10, col("number"))
>   partitionedDf1.createOrReplaceTempView("df1")
>   partitionedDf1.cache() partitionedDf1.count()
>   
>   val df2 = sqlContext.createDataFrame(
> Seq(
>   TwoItem(0, "zero"),
>   TwoItem(2, "two"),
>   TwoItem(4, "four")
> )
>   )
>   val partitionedDf2 = df2.repartition(10, col("number"))
>   partitionedDf2.createOrReplaceTempView("df2")
>   partitionedDf2.cache() partitionedDf2.count()
>   
>   val fromDf1 = sqlContext.sql("SELECT number from df1")
>   val fromDf2 = sqlContext.sql("SELECT number from df2")
>   val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased")
>   val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased")
>   aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") }
> {code}
> Both the SQL and the Scala code generate a query-plan where an extra exchange 
> is inserted before performing the sort-merge join. This exchange changes the 
> partitioning from {{HashPartitioning("number", 10)}} for each frame being 
> joined into {{HashPartitioning("aliased", 5)}}. I would have expected that 
> since it's a simple column aliasing, and both frames have exactly the same 
> partitioning that the initial frames.
> {noformat} 
> *Project [args=[aliased#267L]][outPart=PartitioningCollection(5, 
> hashpartitioning(aliased#267L, 5)%NONNULL,hashpartitioning(aliased#270L, 
> 5)%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=List(aliased#267:bigint%NONNULL)]
> +- *SortMergeJoin [args=[aliased#267L], [aliased#270L], 
> Inner][outPart=PartitioningCollection(5, hashpartitioning(aliased#267L, 
> 5)%NONNULL,hashpartitioning(aliased#270L, 
> 5)%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL, 
> aliased#270:bigint%NONNULL)]
>:- *Sort [args=[aliased#267L ASC], false, 0][outPart=HashPartitioning(5, 
> aliased#267:bigint%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>:  +- Exchange [args=hashpartitioning(aliased#267L, 
> 5)%NONNULL][outPart=HashPartitioning(5, 
> aliased#267:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>: +- *Project [args=[number#198L AS 
> aliased#267L]][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>:+- InMemoryTableScan 
> [args=[number#198L]][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#198:bigint%NONNULL)]
>:   :  +- InMemoryRelation [number#198L], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas), 
> false[Statistics(24,false)][output=List(number#198:bigint%NONNULL)]
>:   : :  +- Exchange [args=hashpartitioning(number#198L, 
> 10)%NONNULL][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
>:   : : +- LocalTableScan 
> [args=[number#198L]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
>+- *Sort [args=[aliased#270L ASC], false, 0][outPart=HashPartitioning(5, 
> aliased#270:bigint%NONNULL)][outOrder=List(aliased#270L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
>   +- Exchange [args=hashpartitioning(aliased#270L, 
> 5)%NONNULL][outPart=HashPartitioning(5, 
> aliased#270:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
>  +- *Project [args=[number#223L AS 

[jira] [Commented] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938718#comment-15938718
 ] 

Apache Spark commented on SPARK-19981:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17400

> Sort-Merge join inserts shuffles when joining dataframes with aliased columns
> -
>
> Key: SPARK-19981
> URL: https://issues.apache.org/jira/browse/SPARK-19981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Allen George
>
> Performing a sort-merge join with two dataframes - each of which has the join 
> column aliased - causes Spark to insert an unnecessary shuffle.
> Consider the scala test code below, which should be equivalent to the 
> following SQL.
> {code:SQL}
> SELECT * FROM
>   (SELECT number AS aliased from df1) t1
> LEFT JOIN
>   (SELECT number AS aliased from df2) t2
> ON t1.aliased = t2.aliased
> {code}
> {code:scala}
> private case class OneItem(number: Long)
> private case class TwoItem(number: Long, value: String)
> test("join with aliases should not trigger shuffle") {
>   val df1 = sqlContext.createDataFrame(
> Seq(
>   OneItem(0),
>   OneItem(2),
>   OneItem(4)
> )
>   )
>   val partitionedDf1 = df1.repartition(10, col("number"))
>   partitionedDf1.createOrReplaceTempView("df1")
>   partitionedDf1.cache() partitionedDf1.count()
>   
>   val df2 = sqlContext.createDataFrame(
> Seq(
>   TwoItem(0, "zero"),
>   TwoItem(2, "two"),
>   TwoItem(4, "four")
> )
>   )
>   val partitionedDf2 = df2.repartition(10, col("number"))
>   partitionedDf2.createOrReplaceTempView("df2")
>   partitionedDf2.cache() partitionedDf2.count()
>   
>   val fromDf1 = sqlContext.sql("SELECT number from df1")
>   val fromDf2 = sqlContext.sql("SELECT number from df2")
>   val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased")
>   val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased")
>   aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") }
> {code}
> Both the SQL and the Scala code generate a query-plan where an extra exchange 
> is inserted before performing the sort-merge join. This exchange changes the 
> partitioning from {{HashPartitioning("number", 10)}} for each frame being 
> joined into {{HashPartitioning("aliased", 5)}}. I would have expected that 
> since it's a simple column aliasing, and both frames have exactly the same 
> partitioning that the initial frames.
> {noformat} 
> *Project [args=[aliased#267L]][outPart=PartitioningCollection(5, 
> hashpartitioning(aliased#267L, 5)%NONNULL,hashpartitioning(aliased#270L, 
> 5)%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=List(aliased#267:bigint%NONNULL)]
> +- *SortMergeJoin [args=[aliased#267L], [aliased#270L], 
> Inner][outPart=PartitioningCollection(5, hashpartitioning(aliased#267L, 
> 5)%NONNULL,hashpartitioning(aliased#270L, 
> 5)%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL, 
> aliased#270:bigint%NONNULL)]
>:- *Sort [args=[aliased#267L ASC], false, 0][outPart=HashPartitioning(5, 
> aliased#267:bigint%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>:  +- Exchange [args=hashpartitioning(aliased#267L, 
> 5)%NONNULL][outPart=HashPartitioning(5, 
> aliased#267:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>: +- *Project [args=[number#198L AS 
> aliased#267L]][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>:+- InMemoryTableScan 
> [args=[number#198L]][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#198:bigint%NONNULL)]
>:   :  +- InMemoryRelation [number#198L], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas), 
> false[Statistics(24,false)][output=List(number#198:bigint%NONNULL)]
>:   : :  +- Exchange [args=hashpartitioning(number#198L, 
> 10)%NONNULL][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
>:   : : +- LocalTableScan 
> [args=[number#198L]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
>+- *Sort [args=[aliased#270L ASC], false, 0][outPart=HashPartitioning(5, 
> aliased#270:bigint%NONNULL)][outOrder=List(aliased#270L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
>   +- Exchange [args=hashpartitioning(aliased#270L, 
> 5)%NONNULL][outPart=HashPartitioning(5, 
> 

[jira] [Assigned] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19981:


Assignee: (was: Apache Spark)

> Sort-Merge join inserts shuffles when joining dataframes with aliased columns
> -
>
> Key: SPARK-19981
> URL: https://issues.apache.org/jira/browse/SPARK-19981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Allen George
>
> Performing a sort-merge join with two dataframes - each of which has the join 
> column aliased - causes Spark to insert an unnecessary shuffle.
> Consider the scala test code below, which should be equivalent to the 
> following SQL.
> {code:SQL}
> SELECT * FROM
>   (SELECT number AS aliased from df1) t1
> LEFT JOIN
>   (SELECT number AS aliased from df2) t2
> ON t1.aliased = t2.aliased
> {code}
> {code:scala}
> private case class OneItem(number: Long)
> private case class TwoItem(number: Long, value: String)
> test("join with aliases should not trigger shuffle") {
>   val df1 = sqlContext.createDataFrame(
> Seq(
>   OneItem(0),
>   OneItem(2),
>   OneItem(4)
> )
>   )
>   val partitionedDf1 = df1.repartition(10, col("number"))
>   partitionedDf1.createOrReplaceTempView("df1")
>   partitionedDf1.cache() partitionedDf1.count()
>   
>   val df2 = sqlContext.createDataFrame(
> Seq(
>   TwoItem(0, "zero"),
>   TwoItem(2, "two"),
>   TwoItem(4, "four")
> )
>   )
>   val partitionedDf2 = df2.repartition(10, col("number"))
>   partitionedDf2.createOrReplaceTempView("df2")
>   partitionedDf2.cache() partitionedDf2.count()
>   
>   val fromDf1 = sqlContext.sql("SELECT number from df1")
>   val fromDf2 = sqlContext.sql("SELECT number from df2")
>   val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased")
>   val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased")
>   aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") }
> {code}
> Both the SQL and the Scala code generate a query-plan where an extra exchange 
> is inserted before performing the sort-merge join. This exchange changes the 
> partitioning from {{HashPartitioning("number", 10)}} for each frame being 
> joined into {{HashPartitioning("aliased", 5)}}. I would have expected that 
> since it's a simple column aliasing, and both frames have exactly the same 
> partitioning that the initial frames.
> {noformat} 
> *Project [args=[aliased#267L]][outPart=PartitioningCollection(5, 
> hashpartitioning(aliased#267L, 5)%NONNULL,hashpartitioning(aliased#270L, 
> 5)%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=List(aliased#267:bigint%NONNULL)]
> +- *SortMergeJoin [args=[aliased#267L], [aliased#270L], 
> Inner][outPart=PartitioningCollection(5, hashpartitioning(aliased#267L, 
> 5)%NONNULL,hashpartitioning(aliased#270L, 
> 5)%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL, 
> aliased#270:bigint%NONNULL)]
>:- *Sort [args=[aliased#267L ASC], false, 0][outPart=HashPartitioning(5, 
> aliased#267:bigint%NONNULL)][outOrder=List(aliased#267L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>:  +- Exchange [args=hashpartitioning(aliased#267L, 
> 5)%NONNULL][outPart=HashPartitioning(5, 
> aliased#267:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>: +- *Project [args=[number#198L AS 
> aliased#267L]][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
>:+- InMemoryTableScan 
> [args=[number#198L]][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#198:bigint%NONNULL)]
>:   :  +- InMemoryRelation [number#198L], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas), 
> false[Statistics(24,false)][output=List(number#198:bigint%NONNULL)]
>:   : :  +- Exchange [args=hashpartitioning(number#198L, 
> 10)%NONNULL][outPart=HashPartitioning(10, 
> number#198:bigint%NONNULL)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
>:   : : +- LocalTableScan 
> [args=[number#198L]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
>+- *Sort [args=[aliased#270L ASC], false, 0][outPart=HashPartitioning(5, 
> aliased#270:bigint%NONNULL)][outOrder=List(aliased#270L 
> ASC%NONNULL)][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
>   +- Exchange [args=hashpartitioning(aliased#270L, 
> 5)%NONNULL][outPart=HashPartitioning(5, 
> aliased#270:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
>  +- *Project [args=[number#223L AS 
> 

[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-23 Thread Shubham Chopra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938676#comment-15938676
 ] 

Shubham Chopra commented on SPARK-19803:


Any feedback on the PR - https://github.com/apache/spark/pull/17325 ? 

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Shubham Chopra
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-23 Thread Daniel Nuriyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nuriyev updated SPARK-20037:
---
Attachment: offsets.png

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-23 Thread Daniel Nuriyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938600#comment-15938600
 ] 

Daniel Nuriyev commented on SPARK-20037:


This is en exception from partition 1 of another topic:
Beginning offset 290 is after the ending offset 2806790 for topic 
cimba_raw_inbox partition 1.
I am attaching a screenshot from Kafka Tool that shows offsets of that 
topic/partition

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20072) Clarify ALS-WR documentation

2017-03-23 Thread chris snow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938477#comment-15938477
 ] 

chris snow commented on SPARK-20072:


Fair enough.  Though this did cause me some grief - I had read and re-read that 
paragraph a number of times before posting an email to the user group to verify 
if my understanding was correct.

I also appreciate that my suggested rewording could probably be improved a lot.

I fully understand if this ticket should be closed because it doesn't merit the 
effort processing it.

> Clarify ALS-WR documentation
> 
>
> Key: SPARK-20072
> URL: https://issues.apache.org/jira/browse/SPARK-20072
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: chris snow
>Priority: Trivial
>
> https://www.mail-archive.com/user@spark.apache.org/msg62590.html
> The documentation for collaborative filtering is as follows:
> ===
> Scaling of the regularization parameter
> Since v1.1, we scale the regularization parameter lambda in solving
> each least squares problem by the number of ratings the user generated
> in updating user factors, or the number of ratings the product
> received in updating product factors.
> ===
> I find this description confusing, probably because I lack a detailed
> understanding of ALS.   The wording suggest that the number of ratings
> change ("generated", "received") during solving the least squares.
> This is how I think I should be interpreting the description:
> ===
> Since v1.1, we scale the regularization parameter lambda when solving
> each least squares problem.  When updating the user factors, we scale
> the regularization parameter by the total number of ratings from the
> user.  Similarly, when updating the product factors, we scale the
> regularization parameter by the total number of ratings for the
> product.
> ===



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

2017-03-23 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938475#comment-15938475
 ] 

Barry Becker commented on SPARK-20071:
--

Yes. I agree. I wanted to report the issue, but wasn't sure if it should be a 
bug or enhancement request. The main problem for me was that it took a fair bit 
of debugging to discover what the problem was. It might be nice to provide a 
warning or more info in the exception. I will find a way to work around it. 

> StringIndexer overflows Kryo serialization buffer when run on column with 
> many long distinct values
> ---
>
> Key: SPARK-20071
> URL: https://issues.apache.org/jira/browse/SPARK-20071
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Priority: Minor
>
> I marked this as minor because there are workarounds.
> I have a 2 million row dataset with a string column that is mostly unique and 
> contains many very long values.
> Most of the values are between 1,000 and 40, characters long.
> I am using Kryoserializer and increased the  spark.kryoserializer.buffer.max 
> to 256m. 
> If I try to run StringIndexer.fit on this column, I will get an OutOfMemory 
> exception or more likely a Buffer overflow error like 
> {code}
> org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 23. 
> To avoid this, increase spark.kryoserializer.buffer.max 
> value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>  
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> This result is not that surprising given that we are trying to index a column 
> like this, however, I can think of some suggestions that would help avoid the 
> error and maybe help performance.
> These possible enhancements to StringIndexer might be hard, but I thought I 
> would suggest them anyway, just in case they are not.
> 1) Add param for Top N values. I know that StringIndexer gives lower indices 
> to the more commonly occurring values. It would be great if one could specify 
> that I only want to index the top N values and long everything else into a 
> special "Other" value.
> 2) Add param for label length limit. Only consider the first L characters of 
> labels when doing the indexing.
> Either of these enhancements would work, but I suppose they can also be 
> implemented with additional work as steps preceding the indexer in the 
> pipeline. Perhaps topByKey could be used to replace the column with one that 
> has the top values and "Other" as suggesed in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20068) Twenty-two column coalesce has pool performance when codegen is open

2017-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938471#comment-15938471
 ] 

Takeshi Yamamuro commented on SPARK-20068:
--

Have you tried v2.1? The latest also has the same performance pitfall?

> Twenty-two column coalesce has pool performance when codegen is open
> 
>
> Key: SPARK-20068
> URL: https://issues.apache.org/jira/browse/SPARK-20068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: QQShu1
>
> when the coalesce has Twenty-one  column the select performance is well, but 
> the Twenty-two column coalesce the select performance is pool. 
> I use the 22 column coalesce select commend is that : 
> SELECT  
> SUM(COALESCE(col1,0)) AS CUSTOM1, 
> SUM(COALESCE(col2,0)) AS CUSTOM2, 
> SUM(COALESCE(col3,0)) AS CUSTOM3, 
> SUM(COALESCE(col4,0)) AS CUSTOM4, 
> SUM(COALESCE(col5,0)) AS CUSTOM5, 
> SUM(COALESCE(col6,0)) AS CUSTOM6, 
> SUM(COALESCE(col7,0)) AS CUSTOM7, 
> SUM(COALESCE(col8,0)) AS CUSTOM8, 
> SUM(COALESCE(col9,0)) AS CUSTOM9, 
> SUM(COALESCE(col10,0)) AS CUSTOM10, 
> SUM(COALESCE(col11,0)) AS CUSTOM11, 
> SUM(COALESCE(col12,0)) AS CUSTOM12, 
> SUM(COALESCE(col13,0)) AS CUSTOM13, 
> SUM(COALESCE(col14,0)) AS CUSTOM14, 
> SUM(COALESCE(col15,0)) AS CUSTOM15, 
> SUM(COALESCE(col16,0)) AS CUSTOM16, 
> SUM(COALESCE(col17,0)) AS CUSTOM17, 
> SUM(COALESCE(col18,0)) AS CUSTOM18, 
> SUM(COALESCE(col19,0)) AS CUSTOM19, 
> SUM(COALESCE(col20,0)) AS CUSTOM20, 
> SUM(COALESCE(col21,0)) AS CUSTOM21, 
> SUM(COALESCE(col22,0)) AS CUSTOM22
> FROM txt_table ;
>  the 21 column select commend it that:
> SELECT  
> SUM(COALESCE(col1,0)) AS CUSTOM1, 
> SUM(COALESCE(col2,0)) AS CUSTOM2, 
> SUM(COALESCE(col3,0)) AS CUSTOM3, 
> SUM(COALESCE(col4,0)) AS CUSTOM4, 
> SUM(COALESCE(col5,0)) AS CUSTOM5, 
> SUM(COALESCE(col6,0)) AS CUSTOM6, 
> SUM(COALESCE(col7,0)) AS CUSTOM7, 
> SUM(COALESCE(col8,0)) AS CUSTOM8, 
> SUM(COALESCE(col9,0)) AS CUSTOM9, 
> SUM(COALESCE(col10,0)) AS CUSTOM10, 
> SUM(COALESCE(col11,0)) AS CUSTOM11, 
> SUM(COALESCE(col12,0)) AS CUSTOM12, 
> SUM(COALESCE(col13,0)) AS CUSTOM13, 
> SUM(COALESCE(col14,0)) AS CUSTOM14, 
> SUM(COALESCE(col15,0)) AS CUSTOM15, 
> SUM(COALESCE(col16,0)) AS CUSTOM16, 
> SUM(COALESCE(col17,0)) AS CUSTOM17, 
> SUM(COALESCE(col18,0)) AS CUSTOM18, 
> SUM(COALESCE(col19,0)) AS CUSTOM19, 
> SUM(COALESCE(col20,0)) AS CUSTOM20, 
> SUM(COALESCE(col21,0)) AS CUSTOM21
> FROM txt_table ;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0

2017-03-23 Thread Daniel Nuriyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938472#comment-15938472
 ] 

Daniel Nuriyev commented on SPARK-20036:


To provide more info I am attaching the pom.xml and the code with comments that 
i used to narrow down onto the issue.
Debugging lead me to KafkaUtils.fixKafkaParams that replaces "earliest" with 
"none":
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaUtils.scala
KafkaUtils.fixKafkaParams is called from the package private class 
DirectKafkaInputDStream.
I do not know if this is the reason.

The way to reproduce is to run the attached code against a topic that already 
has entries with offsets > 0. The problem is that no existing entries are read, 
only new entries are read.
I could consistently reproduce the problem.
The problem appeared when we upgraded the kafka client from 0.8 to 0.10.0

> impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 
> 
>
> Key: SPARK-20036
> URL: https://issues.apache.org/jira/browse/SPARK-20036
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, pom.xml
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the whole topic using:
> kafkaParams.put("auto.offset.reset", "earliest");
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> Each batch returns empty.
> I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that 
> overrides "earliest" with "none".
> Whether this is related or not, when I used kafka 0.8 on the client with 
> kafka 0.10.1 on the server, I could read the whole topic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0

2017-03-23 Thread Daniel Nuriyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nuriyev updated SPARK-20036:
---
Attachment: Main.java
pom.xml

> impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 
> 
>
> Key: SPARK-20036
> URL: https://issues.apache.org/jira/browse/SPARK-20036
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, pom.xml
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the whole topic using:
> kafkaParams.put("auto.offset.reset", "earliest");
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> Each batch returns empty.
> I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that 
> overrides "earliest" with "none".
> Whether this is related or not, when I used kafka 0.8 on the client with 
> kafka 0.10.1 on the server, I could read the whole topic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938438#comment-15938438
 ] 

Apache Spark commented on SPARK-19716:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17398

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19716:


Assignee: Apache Spark

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-03-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19716:


Assignee: (was: Apache Spark)

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20072) Clarify ALS-WR documentation

2017-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15938415#comment-15938415
 ] 

Sean Owen commented on SPARK-20072:
---

I don't think those two wordings differ meaningfully? I think small suggestions 
are OK but weight the value vs overhead of processing these changes. This is 
pretty borderline.

> Clarify ALS-WR documentation
> 
>
> Key: SPARK-20072
> URL: https://issues.apache.org/jira/browse/SPARK-20072
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: chris snow
>Priority: Trivial
>
> https://www.mail-archive.com/user@spark.apache.org/msg62590.html
> The documentation for collaborative filtering is as follows:
> ===
> Scaling of the regularization parameter
> Since v1.1, we scale the regularization parameter lambda in solving
> each least squares problem by the number of ratings the user generated
> in updating user factors, or the number of ratings the product
> received in updating product factors.
> ===
> I find this description confusing, probably because I lack a detailed
> understanding of ALS.   The wording suggest that the number of ratings
> change ("generated", "received") during solving the least squares.
> This is how I think I should be interpreting the description:
> ===
> Since v1.1, we scale the regularization parameter lambda when solving
> each least squares problem.  When updating the user factors, we scale
> the regularization parameter by the total number of ratings from the
> user.  Similarly, when updating the product factors, we scale the
> regularization parameter by the total number of ratings for the
> product.
> ===



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20072) Clarify ALS-WR documentation

2017-03-23 Thread chris snow (JIRA)
chris snow created SPARK-20072:
--

 Summary: Clarify ALS-WR documentation
 Key: SPARK-20072
 URL: https://issues.apache.org/jira/browse/SPARK-20072
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.1.0
Reporter: chris snow
Priority: Trivial


https://www.mail-archive.com/user@spark.apache.org/msg62590.html

The documentation for collaborative filtering is as follows:

===
Scaling of the regularization parameter

Since v1.1, we scale the regularization parameter lambda in solving
each least squares problem by the number of ratings the user generated
in updating user factors, or the number of ratings the product
received in updating product factors.
===

I find this description confusing, probably because I lack a detailed
understanding of ALS.   The wording suggest that the number of ratings
change ("generated", "received") during solving the least squares.

This is how I think I should be interpreting the description:

===
Since v1.1, we scale the regularization parameter lambda when solving
each least squares problem.  When updating the user factors, we scale
the regularization parameter by the total number of ratings from the
user.  Similarly, when updating the product factors, we scale the
regularization parameter by the total number of ratings for the
product.
===



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

2017-03-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20071:
--
Issue Type: Improvement  (was: Bug)

Not a bug, right?
You can effect some of this yourself with preprocessing. Yes it's manual but it 
means you have full flexibility. Putting it in StringIndexer may not actually 
simplify things.

> StringIndexer overflows Kryo serialization buffer when run on column with 
> many long distinct values
> ---
>
> Key: SPARK-20071
> URL: https://issues.apache.org/jira/browse/SPARK-20071
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Priority: Minor
>
> I marked this as minor because there are workarounds.
> I have a 2 million row dataset with a string column that is mostly unique and 
> contains many very long values.
> Most of the values are between 1,000 and 40, characters long.
> I am using Kryoserializer and increased the  spark.kryoserializer.buffer.max 
> to 256m. 
> If I try to run StringIndexer.fit on this column, I will get an OutOfMemory 
> exception or more likely a Buffer overflow error like 
> {code}
> org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 23. 
> To avoid this, increase spark.kryoserializer.buffer.max 
> value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>  
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> This result is not that surprising given that we are trying to index a column 
> like this, however, I can think of some suggestions that would help avoid the 
> error and maybe help performance.
> These possible enhancements to StringIndexer might be hard, but I thought I 
> would suggest them anyway, just in case they are not.
> 1) Add param for Top N values. I know that StringIndexer gives lower indices 
> to the more commonly occurring values. It would be great if one could specify 
> that I only want to index the top N values and long everything else into a 
> special "Other" value.
> 2) Add param for label length limit. Only consider the first L characters of 
> labels when doing the indexing.
> Either of these enhancements would work, but I suppose they can also be 
> implemented with additional work as steps preceding the indexer in the 
> pipeline. Perhaps topByKey could be used to replace the column with one that 
> has the top values and "Other" as suggesed in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19728) PythonUDF with multiple parents shouldn't be pushed down when used as a predicate

2017-03-23 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz closed SPARK-19728.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

>  PythonUDF with multiple parents shouldn't be pushed down when used as a 
> predicate
> --
>
> Key: SPARK-19728
> URL: https://issues.apache.org/jira/browse/SPARK-19728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Prior to Spark 2.0 it was possible to use Python UDF output as a predicate:
> {code}
> from pyspark.sql.functions import udf
> from pyspark.sql.types import BooleanType
> df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"])
> df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"])
> pred = udf(lambda x, y: x == y, BooleanType())
> df1.join(df2).where(pred("col_a", "col_b")).show()
> {code}
> In Spark 2.0 this is no longer possible:
> {code}
> spark.conf.set("spark.sql.crossJoin.enabled", True)
> df1.join(df2).where(pred("col_a", "col_b")).show()
> ## ...
> ## Py4JJavaError: An error occurred while calling o731.showString.
> : java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, 
> col_b#135L), requires attributes from more than one child.
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >