date:20170328

[jira] [Updated] (SPARK-20135) spark thriftserver2: no job running but cores not release on yarn

2017-03-28 Thread bruce xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bruce xu updated SPARK-20135:
-
Attachment: 0329-3.png
0329-2.png
0329-1.png

cores and memory not release for a long time when no job running

> spark thriftserver2: no job running but cores not release on yarn
> -
>
> Key: SPARK-20135
> URL: https://issues.apache.org/jira/browse/SPARK-20135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark 2.0.1 with hadoop 2.6.0 
>Reporter: bruce xu
> Attachments: 0329-1.png, 0329-2.png, 0329-3.png
>
>
> i opened the executor dynamic allocation feature, however it doesn't work 
> sometimes.
> i set the initial executor num 50,  after job finished the cores and mem 
> resource did not release. 
> from the spark web UI, the active job/running task/stage num is 0 , but the 
> executors page show  cores 1276, active task 7288.
> from the yarn web UI,  the thriftserver job's running containers is 639
> this may be a bug. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20135) spark thriftserver2: no job running but cores not release on yarn

2017-03-28 Thread bruce xu (JIRA)

bruce xu created SPARK-20135:


 Summary: spark thriftserver2: no job running but cores not release 
on yarn
 Key: SPARK-20135
 URL: https://issues.apache.org/jira/browse/SPARK-20135
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
 Environment: spark 2.0.1 with hadoop 2.6.0 
Reporter: bruce xu


i opened the executor dynamic allocation feature, however it doesn't work 
sometimes.

i set the initial executor num 50,  after job finished the cores and mem 
resource did not release. 

from the spark web UI, the active job/running task/stage num is 0 , but the 
executors page show  cores 1276, active task 7288.

from the yarn web UI,  the thriftserver job's running containers is 639

this may be a bug. 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

2017-03-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20107:

Description: 
Set {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}} can 
speed up 
[HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
 for many output files. 

It can speed up {{11 minutes}} for 216869 output files:
{code:sql}
CREATE TABLE tmp.spark_20107 AS SELECT
  category_id,
  product_id,
  track_id,
  concat(
substr(ds, 3, 2),
substr(ds, 6, 2),
substr(ds, 9, 2)
  ) shortDate,
  CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
'invalid actio' END AS type
FROM
  tmp.user_action
WHERE
  ds > date_sub('2017-01-23', 730)
AND actiontype IN ('0','1','2','3');
{code}
{code}
$ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
216870
{code}

We should add this option to 
[configuration.md|http://spark.apache.org/docs/latest/configuration.html].

All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: 
[cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
 and 
[cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
 and apache's hadoop 2.7.0 or higher versions support this improvement.

  was:
Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
[HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
 for many output files.

It can speed up {{11 minutes}} for 216869 output files:
{code:sql}
CREATE TABLE tmp.spark_20107 AS SELECT
  category_id,
  product_id,
  track_id,
  concat(
substr(ds, 3, 2),
substr(ds, 6, 2),
substr(ds, 9, 2)
  ) shortDate,
  CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
'invalid actio' END AS type
FROM
  tmp.user_action
WHERE
  ds > date_sub('2017-01-23', 730)
AND actiontype IN ('0','1','2','3');
{code}
{code}
$ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
216870
{code}


This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
versions(see: 
[cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
 and 
[cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
 and apache's hadoop 2.7.0 higher versions.


> Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to 
> configuration.md
> ---
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Trivial
>
> Set {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}} can 
> speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files. 
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> We should add this option to 
> [configuration.md|http://spark.apache.org/docs/latest/configuration.html].
> All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 or higher versions support this improvement.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (SPARK-20107) Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md

2017-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20107:
-

Assignee: Yuming Wang
Priority: Trivial  (was: Major)
 Summary: Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 
option to configuration.md  (was: Speed up 
HadoopMapReduceCommitProtocol#commitJob for many output files)

> Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to 
> configuration.md
> ---
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Trivial
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20107:

Component/s: (was: SQL)

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-20107:

Component/s: Documentation

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20134:


Assignee: Reynold Xin  (was: Apache Spark)

> SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
> -
>
> Key: SPARK-20134
> URL: https://issues.apache.org/jira/browse/SPARK-20134
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> It is not super intuitive how to update SQLMetric on the driver side. This 
> patch introduces a new SQLMetrics.postDriverMetricUpdates function to do 
> that, and adds documentation to make it more obvious.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20134:


Assignee: Apache Spark  (was: Reynold Xin)

> SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
> -
>
> Key: SPARK-20134
> URL: https://issues.apache.org/jira/browse/SPARK-20134
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> It is not super intuitive how to update SQLMetric on the driver side. This 
> patch introduces a new SQLMetrics.postDriverMetricUpdates function to do 
> that, and adds documentation to make it more obvious.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946525#comment-15946525
 ] 

Apache Spark commented on SPARK-20134:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17464

> SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates
> -
>
> Key: SPARK-20134
> URL: https://issues.apache.org/jira/browse/SPARK-20134
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> It is not super intuitive how to update SQLMetric on the driver side. This 
> patch introduces a new SQLMetrics.postDriverMetricUpdates function to do 
> that, and adds documentation to make it more obvious.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20134) SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates

2017-03-28 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-20134:
---

 Summary: SQLMetrics.postDriverMetricUpdates to simplify driver 
side metric updates
 Key: SPARK-20134
 URL: https://issues.apache.org/jira/browse/SPARK-20134
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


It is not super intuitive how to update SQLMetric on the driver side. This 
patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, 
and adds documentation to make it more obvious.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20131) Flaky Test: org.apache.spark.streaming.StreamingContextSuite

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20131:


Assignee: (was: Apache Spark)

> Flaky Test: org.apache.spark.streaming.StreamingContextSuite
> 
>
> Key: SPARK-20131
> URL: https://issues.apache.org/jira/browse/SPARK-20131
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly.
> Error Message
> {code}
> latch.await(60L, SECONDS) was false
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was 
> false
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:44)
>   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
>   at 
>

[jira] [Commented] (SPARK-20131) Flaky Test: org.apache.spark.streaming.StreamingContextSuite

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946522#comment-15946522
 ] 

Apache Spark commented on SPARK-20131:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17463

> Flaky Test: org.apache.spark.streaming.StreamingContextSuite
> 
>
> Key: SPARK-20131
> URL: https://issues.apache.org/jira/browse/SPARK-20131
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly.
> Error Message
> {code}
> latch.await(60L, SECONDS) was false
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was 
> false
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:44)
>   at

[jira] [Resolved] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one

2017-03-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20093.
--
Resolution: Duplicate

^ It seems a duplicate of that to me as well. I am resolving this. Please 
reopen if anyone feels this is different.

> Exception when Joining dataframe with another dataframe generated by applying 
> groupBy transformation on original one
> 
>
> Key: SPARK-20093
> URL: https://issues.apache.org/jira/browse/SPARK-20093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0
>Reporter: Hosur Narahari
>
> When we generate a dataframe by doing grouping, and perform join on original 
> dataframe with aggregate column, we get AnalysisException. Below I've 
> attached a piece of code and resulting exception to reproduce.
> Code:
> import org.apache.spark.sql.SparkSession
> object App {
>   lazy val spark = 
> SparkSession.builder.appName("Test").master("local").getOrCreate
>   def main(args: Array[String]): Unit = {
> test1
>   }
>   private def test1 {
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", 
> 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight")
> val groupDF = df.groupBy("gender").agg(min("height").as("height"))
> groupDF.show()
> val out = groupDF.join(df, groupDF("height") <=> 
> df("height")).select(df("gender"), df("height"), df("weight"))
> out.show
>   }
> }
> When I ran above code, I got below exception:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) height#8 missing from 
> height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, 
> (height#19 <=> height#8);;
> !Join Inner, (height#19 <=> height#8)
> :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19]
> :  +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9]
> : +- LocalRelation [_1#0, _2#1, _3#2]
> +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31]
>+- LocalRelation [_1#0, _2#1, _3#2]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2831)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:843)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:807)
>   at App$.test1(App.scala:17)
>   at App$.main(App.scala:9)
>   at App.main(App.scala)
> Please someone look into it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one

2017-03-28 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946501#comment-15946501
 ] 

Takeshi Yamamuro commented on SPARK-20093:
--

It seems this issue is the same with SPARK-10925.

> Exception when Joining dataframe with another dataframe generated by applying 
> groupBy transformation on original one
> 
>
> Key: SPARK-20093
> URL: https://issues.apache.org/jira/browse/SPARK-20093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0
>Reporter: Hosur Narahari
>
> When we generate a dataframe by doing grouping, and perform join on original 
> dataframe with aggregate column, we get AnalysisException. Below I've 
> attached a piece of code and resulting exception to reproduce.
> Code:
> import org.apache.spark.sql.SparkSession
> object App {
>   lazy val spark = 
> SparkSession.builder.appName("Test").master("local").getOrCreate
>   def main(args: Array[String]): Unit = {
> test1
>   }
>   private def test1 {
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", 
> 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight")
> val groupDF = df.groupBy("gender").agg(min("height").as("height"))
> groupDF.show()
> val out = groupDF.join(df, groupDF("height") <=> 
> df("height")).select(df("gender"), df("height"), df("weight"))
> out.show
>   }
> }
> When I ran above code, I got below exception:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) height#8 missing from 
> height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, 
> (height#19 <=> height#8);;
> !Join Inner, (height#19 <=> height#8)
> :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19]
> :  +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9]
> : +- LocalRelation [_1#0, _2#1, _3#2]
> +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31]
>+- LocalRelation [_1#0, _2#1, _3#2]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2831)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:843)
>   at org.apache.spark.sql.Dataset.join(Dataset.scala:807)
>   at App$.test1(App.scala:17)
>   at App$.main(App.scala:9)
>   at App.main(App.scala)
> Please someone look into it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

2017-03-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946493#comment-15946493
 ] 

Saisai Shao commented on SPARK-20128:
-

Sorry I cannot access the logs. What I could see from the link provided above 
is:

{noformat}
[info] - internal accumulators in multiple stages (185 milliseconds)
3/24/17 2:02:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 2:22:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 2:42:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 3:02:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 3:22:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 3:42:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 4:02:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 4:22:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 4:42:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 5:02:19 PM =

-- Gauges --
master.aliveWorkers
3/24/17 5:22:19 PM =

-- Gauges --
{noformat}

>From the console output what I could see is that after this {{internal 
>accumulators in multiple stages}} unit test is finished, then the whole test 
>is hang, and just print some metrics information.

> MetricsSystem not always killed in SparkContext.stop()
> --
>
> Key: SPARK-20128
> URL: https://issues.apache.org/jira/browse/SPARK-20128
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
>
> One Jenkins run failed due to the MetricsSystem never getting killed after a 
> failed test, which led that test to hang and the tests to timeout:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176
> {noformat}
> 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
> DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
> down SparkContext
> java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
> at scala.Option.flatMap(Option.scala:171)
> at 
> org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager 
> stopped
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
> BlockManagerMaster stopped
> 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
>

[jira] [Commented] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

2017-03-28 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946486#comment-15946486
 ] 

Imran Rashid commented on SPARK-20128:
--

Thanks [~jerryshao], that is helpful, definitely good to know this seems to be 
limited to unit tests.

But I still don't see how the Master isn't getting stopped.  We never see 
{{logInfo("Shutting down local Spark cluster.")}}, which means that the we're 
never calling {{LocalSparkCluster.stop()}} in this case.

> MetricsSystem not always killed in SparkContext.stop()
> --
>
> Key: SPARK-20128
> URL: https://issues.apache.org/jira/browse/SPARK-20128
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
>
> One Jenkins run failed due to the MetricsSystem never getting killed after a 
> failed test, which led that test to hang and the tests to timeout:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176
> {noformat}
> 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
> DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
> down SparkContext
> java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
> at scala.Option.flatMap(Option.scala:171)
> at 
> org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager 
> stopped
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
> BlockManagerMaster stopped
> 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
> ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
> Exception was suppressed.
> java.lang.NullPointerException
> at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
> at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
> at 
> com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
> ...
> {noformat}
> unfortunately I didn't save the entire test logs, but what happens is the 
> initial IndexOutOfBoundsException is a real bug, which causes the 
> SparkContext to stop, and the test to fail.  However, the MetricsSystem 
> somehow stays alive, and since its not a daemon thread, it just hangs, and 
> every 20 mins we get that NPE from within the metrics system as it tries to 
> report.
> I am totally perplexed at how this can happen, it looks like the metric 
> system should always get stopped by the time we see
> {noformat}
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> {noformat}
> I don't think I've ever seen this in a real spark use, but it doesn't look 
> like something which is limited to tests, whatever the cause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

2017-03-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946457#comment-15946457
 ] 

Saisai Shao edited comment on SPARK-20128 at 3/29/17 3:20 AM:
--

Here the exception is from MasterSource, which only exists in Standalone 
Master, I think it should not be related to SparkContext, may be the Master is 
not cleanly stopped. --Also as I remembered by default we will not enable 
ConsoleReporter, not sure how this could be happened.--

Looks like we have a metrics property in the test resource, that's why console 
sink will be enabled in UT.


was (Author: jerryshao):
Here the exception is from MasterSource, which only exists in Standalone 
Master, I think it should not be related to SparkContext, may be the Master is 
not cleanly stopped. Also as I remembered by default we will not enable 
ConsoleReporter, not sure how this could be happened.

> MetricsSystem not always killed in SparkContext.stop()
> --
>
> Key: SPARK-20128
> URL: https://issues.apache.org/jira/browse/SPARK-20128
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
>
> One Jenkins run failed due to the MetricsSystem never getting killed after a 
> failed test, which led that test to hang and the tests to timeout:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176
> {noformat}
> 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
> DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
> down SparkContext
> java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
> at scala.Option.flatMap(Option.scala:171)
> at 
> org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager 
> stopped
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
> BlockManagerMaster stopped
> 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
> ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
> Exception was suppressed.
> java.lang.NullPointerException
> at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
> at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
> at 
> com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
> ...
> {noformat}
> unfortunately I didn't save the entire test logs, but what happens is the 
> initial IndexOutOfBoundsException is a real bug, which causes the 
> SparkContext to stop, and the test to fail.  However, the MetricsSystem 
> somehow stays alive, and since its not a daemon thread, it just hangs, and 
> every 20 mins we get that NPE from within the metrics system as it tries to 
> report.
> I am totally perplexed at how this can happen, it looks like the metric 
> system should always get stopped by the time we see
> {noformat}
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> {noformat}
> I don't think I've ever seen this in a real spark use, but it doesn't look 
> like something which is limited to tests, whatever the cause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

2017-03-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946457#comment-15946457
 ] 

Saisai Shao commented on SPARK-20128:
-

Here the exception is from MasterSource, which only exists in Standalone 
Master, I think it should not be related to SparkContext, may be the Master is 
not cleanly stopped. Also as I remembered by default we will not enable 
ConsoleReporter, not sure how this could be happened.

> MetricsSystem not always killed in SparkContext.stop()
> --
>
> Key: SPARK-20128
> URL: https://issues.apache.org/jira/browse/SPARK-20128
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
>
> One Jenkins run failed due to the MetricsSystem never getting killed after a 
> failed test, which led that test to hang and the tests to timeout:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176
> {noformat}
> 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
> DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
> down SparkContext
> java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
> at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
> at scala.Option.flatMap(Option.scala:171)
> at 
> org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager 
> stopped
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
> BlockManagerMaster stopped
> 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
> ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
> Exception was suppressed.
> java.lang.NullPointerException
> at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
> at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
> at 
> com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
> ...
> {noformat}
> unfortunately I didn't save the entire test logs, but what happens is the 
> initial IndexOutOfBoundsException is a real bug, which causes the 
> SparkContext to stop, and the test to fail.  However, the MetricsSystem 
> somehow stays alive, and since its not a daemon thread, it just hangs, and 
> every 20 mins we get that NPE from within the metrics system as it tries to 
> report.
> I am totally perplexed at how this can happen, it looks like the metric 
> system should always get stopped by the time we see
> {noformat}
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> {noformat}
> I don't think I've ever seen this in a real spark use, but it doesn't look 
> like something which is limited to tests, whatever the cause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-03-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20133:
--
Description: Add new user guide section for spark.ml.stat, and document 
ChiSquareTest.  This may involve adding new example scripts.

> User guide for spark.ml.stat.ChiSquareTest
> --
>
> Key: SPARK-20133
> URL: https://issues.apache.org/jira/browse/SPARK-20133
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add new user guide section for spark.ml.stat, and document ChiSquareTest.  
> This may involve adding new example scripts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20133) User guide for spark.ml.stat.ChiSquareTest

2017-03-28 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-20133:
-

 Summary: User guide for spark.ml.stat.ChiSquareTest
 Key: SPARK-20133
 URL: https://issues.apache.org/jira/browse/SPARK-20133
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20040.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17421
[https://github.com/apache/spark/pull/17421]

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
> Fix For: 2.2.0
>
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20040:
-

Assignee: Joseph K. Bradley

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20040:
-

Assignee: Bago Amirbekian  (was: Joseph K. Bradley)

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one

2017-03-28 Thread Yong Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946375#comment-15946375
 ] 

Yong Zhang edited comment on SPARK-20093 at 3/29/17 1:34 AM:
-

This problem exists. It looks like if switching the order of join, then it 
works fine.

scala> spark.version
res16: String = 2.1.0

scala> groupDF.join(df, groupDF("height") === df("height")).show
org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing 
from height#181,gender#7,height#338,weight#339,gender#337 in operator !Join 
Inner, (height#181 = height#8);;
!Join Inner, (height#181 = height#8)
:- Aggregate [gender#7], [gender#7, min(height#8) AS height#181]
:  +- Project [_1#3 AS gender#7, _2#4 AS height#8, _3#5 AS weight#9]
: +- LocalRelation [_1#3, _2#4, _3#5]
+- Project [_1#3 AS gender#337, _2#4 AS height#338, _3#5 AS weight#339]
   +- LocalRelation [_1#3, _2#4, _3#5]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:830)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:796)
  ... 48 elided

scala> df.join(groupDF, groupDF("height") === df("height")).show
+--+--+--+--+--+
|gender|height|weight|gender|height|
+--+--+--+--+--+
| M|   160|55| M|   160|
| F|   150|53| F|   150|
+--+--+--+--+--+


was (Author: java8964):
This problem exists. It looks like if switch the order of join, then it works 
fine.

scala> spark.version
res16: String = 2.1.0

scala> groupDF.join(df, groupDF("height") === df("height")).show
org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing 
from height#181,gender#7,height#338,weight#339,gender#337 in operator !Join 
Inner, (height#181 = height#8);;
!Join Inner, (height#181 = height#8)
:- Aggregate [gender#7], [gender#7, min(height#8) AS height#181]
:  +- Project [_1#3 AS gender#7, _2#4 AS height#8, _3#5 AS weight#9]
: +- LocalRelation [_1#3, _2#4, _3#5]
+- Project [_1#3 AS gender#337, _2#4 AS height#338, _3#5 AS weight#339]
   +- LocalRelation [_1#3, _2#4, _3#5]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:830)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:796)
  ... 48 elided

scala> df.join(groupDF, groupDF("height") === df("height")).show
+--+--+--+--+--+
|gender|height|weight|gender|height|
+--+--+--+--+--+
| M|   160|55| M|   160|
| F|   150|53| F|   150|
+--+--+--+--+--+

> Exception when Joining dataframe with another dataframe generated by applying 
> groupBy transformation on original one
> 
>
> Key: SPARK-20093
> URL: https://issues.apache.org/jira/browse/SPARK-20093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0
>Reporter: Hosur Narahari
>
> When we generate a

[jira] [Commented] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one

2017-03-28 Thread Yong Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946375#comment-15946375
 ] 

Yong Zhang commented on SPARK-20093:


This problem exists. It looks like if switch the order of join, then it works 
fine.

scala> spark.version
res16: String = 2.1.0

scala> groupDF.join(df, groupDF("height") === df("height")).show
org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing 
from height#181,gender#7,height#338,weight#339,gender#337 in operator !Join 
Inner, (height#181 = height#8);;
!Join Inner, (height#181 = height#8)
:- Aggregate [gender#7], [gender#7, min(height#8) AS height#181]
:  +- Project [_1#3 AS gender#7, _2#4 AS height#8, _3#5 AS weight#9]
: +- LocalRelation [_1#3, _2#4, _3#5]
+- Project [_1#3 AS gender#337, _2#4 AS height#338, _3#5 AS weight#339]
   +- LocalRelation [_1#3, _2#4, _3#5]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:337)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:830)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:796)
  ... 48 elided

scala> df.join(groupDF, groupDF("height") === df("height")).show
+--+--+--+--+--+
|gender|height|weight|gender|height|
+--+--+--+--+--+
| M|   160|55| M|   160|
| F|   150|53| F|   150|
+--+--+--+--+--+

> Exception when Joining dataframe with another dataframe generated by applying 
> groupBy transformation on original one
> 
>
> Key: SPARK-20093
> URL: https://issues.apache.org/jira/browse/SPARK-20093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.2.0
>Reporter: Hosur Narahari
>
> When we generate a dataframe by doing grouping, and perform join on original 
> dataframe with aggregate column, we get AnalysisException. Below I've 
> attached a piece of code and resulting exception to reproduce.
> Code:
> import org.apache.spark.sql.SparkSession
> object App {
>   lazy val spark = 
> SparkSession.builder.appName("Test").master("local").getOrCreate
>   def main(args: Array[String]): Unit = {
> test1
>   }
>   private def test1 {
> import org.apache.spark.sql.functions._
> val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", 
> 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight")
> val groupDF = df.groupBy("gender").agg(min("height").as("height"))
> groupDF.show()
> val out = groupDF.join(df, groupDF("height") <=> 
> df("height")).select(df("gender"), df("height"), df("weight"))
> out.show
>   }
> }
> When I ran above code, I got below exception:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) height#8 missing from 
> height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, 
> (height#19 <=> height#8);;
> !Join Inner, (height#19 <=> height#8)
> :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19]
> :  +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9]
> : +- LocalRelation [_1#0, _2#1, _3#2]
> +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31]
>+- LocalRelation [_1#0, _2#1, _3#2]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
>

[jira] [Commented] (SPARK-16288) Implement inline table generating function

2017-03-28 Thread Guilherme Braccialli (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946176#comment-15946176
 ] 

Guilherme Braccialli commented on SPARK-16288:
--

Is it possible to call this function direct from df.select(inline(field))? I 
could only make it work using df.selectExpr("inline(field)").
thanks.

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20043) Decision Tree loader does not handle uppercase impurity param values

2017-03-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20043.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 17407
[https://github.com/apache/spark/pull/17407]

> Decision Tree loader does not handle uppercase impurity param values
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Zied Sellami
>Assignee: Yan Facai (颜发才)
>  Labels: starter
> Fix For: 2.1.1, 2.2.0
>
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20043) Decision Tree loader does not handle uppercase impurity param values

2017-03-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20043:
-

Assignee: Yan Facai (颜发才)

> Decision Tree loader does not handle uppercase impurity param values
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Zied Sellami
>Assignee: Yan Facai (颜发才)
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20132) Add documentation for column string functions

2017-03-28 Thread Michael Patterson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946130#comment-15946130
 ] 

Michael Patterson commented on SPARK-20132:
---

I have a commit with the documentation: 
https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b

I will make a more formal PR tonight.

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20132) Add documentation for column string functions

2017-03-28 Thread Michael Patterson (JIRA)

Michael Patterson created SPARK-20132:
-

 Summary: Add documentation for column string functions
 Key: SPARK-20132
 URL: https://issues.apache.org/jira/browse/SPARK-20132
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 2.1.0
Reporter: Michael Patterson
Priority: Minor


Four Column string functions do not have documentation for PySpark:
rlike
like
startswith
endswith

These functions are called through the _bin_op interface, which allows the 
passing of a docstring. I have added docstrings with examples to each of the 
four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20050:


Assignee: (was: Apache Spark)

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20050:


Assignee: Apache Spark

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>Assignee: Apache Spark
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946098#comment-15946098
 ] 

Apache Spark commented on SPARK-20050:
--

User 'sasakitoa' has created a pull request for this issue:
https://github.com/apache/spark/pull/17462

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20125) Dataset of type option of map does not work

2017-03-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20125.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Dataset of type option of map does not work
> ---
>
> Key: SPARK-20125
> URL: https://issues.apache.org/jira/browse/SPARK-20125
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.1, 2.2.0
>
>
> A simple reproduce:
> {code}
> case class ABC(m: Option[scala.collection.Map[Int, Int]])
> val ds = Seq(ABC(Some(Map(1 -> 1.toDS() 
> ds.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2017-03-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14536:

Fix Version/s: 2.1.1

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>Assignee: Suresh Thalamati
>  Labels: NullPointerException
> Fix For: 2.1.1, 2.2.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Mathieu D (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892
 ] 

Mathieu D edited comment on SPARK-20082 at 3/28/17 8:39 PM:


[~yuhaoyan] would you mind having a look to this PR ? Right now, I added an 
initialModel, suported only by the Online optimizer.

The implementation is inspired from the KMeans one. The initialModel is used as 
a replacement of the initial randomized matrix.

Regarding the EM optimizer, in the same way, we could use an existing model 
instead of a randomly weighted graph, by adding new doc vertices and new 
doc->term edges to the existing graph. But it's unclear for me how the new doc 
vertices should be weighted when added. Right now for a new model, docs and 
terms vertices are weighted randomly, with the same total weight on docs and 
terms. If I add new docs to an existing graph, how to initialize the weights on 
this side ?



was (Author: mathieude):
[~yuhaoyan] would you mind having a look to this PR ? Right now, I added an 
initialModel, suported only by the Online optimizer.

Regarding the EM optimizer, I could add new doc vertices and new doc->term 
edges to the existing graph. But it's unclear for me how the new doc vertices 
should be weighted when added. Right now for a new model, docs and terms 
vertices are weighted randomly, with the same total weight on docs and terms. 
If I add new docs to an existing graph, how to initialize the weights on this 
side ?

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Mathieu D (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892
 ] 

Mathieu D commented on SPARK-20082:
---

[~yuhaoyan] would you mind having a look to this PR. Right now, I added an 
initialModel only for the Online optimizer.

Regarding the EM optimizer, I could add new doc vertices and new doc->term 
edges to the existing graph. But it's unclear for me how the new doc vertices 
should be weighted when added. Right now for a new model, docs and terms 
vertices are weighted randomly, with the same total weight on docs and terms. 
If I add new docs to an existing graph, how to initialize the weights on this 
side ?

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Mathieu D (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892
 ] 

Mathieu D edited comment on SPARK-20082 at 3/28/17 8:26 PM:


[~yuhaoyan] would you mind having a look to this PR ? Right now, I added an 
initialModel only for the Online optimizer.

Regarding the EM optimizer, I could add new doc vertices and new doc->term 
edges to the existing graph. But it's unclear for me how the new doc vertices 
should be weighted when added. Right now for a new model, docs and terms 
vertices are weighted randomly, with the same total weight on docs and terms. 
If I add new docs to an existing graph, how to initialize the weights on this 
side ?


was (Author: mathieude):
[~yuhaoyan] would you mind having a look to this PR. Right now, I added an 
initialModel only for the Online optimizer.

Regarding the EM optimizer, I could add new doc vertices and new doc->term 
edges to the existing graph. But it's unclear for me how the new doc vertices 
should be weighted when added. Right now for a new model, docs and terms 
vertices are weighted randomly, with the same total weight on docs and terms. 
If I add new docs to an existing graph, how to initialize the weights on this 
side ?

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Mathieu D (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945892#comment-15945892
 ] 

Mathieu D edited comment on SPARK-20082 at 3/28/17 8:27 PM:


[~yuhaoyan] would you mind having a look to this PR ? Right now, I added an 
initialModel, suported only by the Online optimizer.

Regarding the EM optimizer, I could add new doc vertices and new doc->term 
edges to the existing graph. But it's unclear for me how the new doc vertices 
should be weighted when added. Right now for a new model, docs and terms 
vertices are weighted randomly, with the same total weight on docs and terms. 
If I add new docs to an existing graph, how to initialize the weights on this 
side ?


was (Author: mathieude):
[~yuhaoyan] would you mind having a look to this PR ? Right now, I added an 
initialModel only for the Online optimizer.

Regarding the EM optimizer, I could add new doc vertices and new doc->term 
edges to the existing graph. But it's unclear for me how the new doc vertices 
should be weighted when added. Right now for a new model, docs and terms 
vertices are weighted randomly, with the same total weight on docs and terms. 
If I add new docs to an existing graph, how to initialize the weights on this 
side ?

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20082:


Assignee: (was: Apache Spark)

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20082:


Assignee: Apache Spark

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>Assignee: Apache Spark
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945866#comment-15945866
 ] 

Apache Spark commented on SPARK-20082:
--

User 'mdespriee' has created a pull request for this issue:
https://github.com/apache/spark/pull/17461

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu D
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks

2017-03-28 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-16929:
-
Issue Type: Improvement  (was: Bug)

> Speculation-related synchronization bottleneck in checkSpeculatableTasks
> 
>
> Key: SPARK-16929
> URL: https://issues.apache.org/jira/browse/SPARK-16929
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Nicholas Brown
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Our cluster has been running slowly since I got speculation working, I looked 
> into it and noticed that stderr was saying some tasks were taking almost an 
> hour to run even though in the application logs on the nodes that task only 
> took a minute or so to run.  Digging into the thread dump for the master node 
> I noticed a number of threads are blocked, apparently by speculation thread.  
> At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while 
> it looks through the tasks to see what needs to be rerun.  Unfortunately that 
> code loops through each of the tasks, so when you have even just a couple 
> hundred thousand tasks to run that can be prohibitively slow to run inside of 
> a synchronized block.  Once I disabled speculation, the job went back to 
> having acceptable performance.
> There are no comments around that lock indicating why it was added, and the 
> git history seems to have a couple refactorings so its hard to find where it 
> was added.  I'm tempted to believe it is the result of someone assuming that 
> an extra synchronized block never hurt anyone (in reality I've probably just 
> as many bugs caused by over synchronization as too little) as it looks too 
> broad to be actually guarding any potential concurrency issue.  But, since 
> concurrency issues can be tricky to reproduce (and yes, I understand that's 
> an extreme understatement) I'm not sure just blindly removing it without 
> being familiar with the history is necessarily safe.  
> Can someone look into this?  Or at least make a note in the documentation 
> that speculation should not be used with large clusters?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19868) conflict TasksetManager lead to spark stopped

2017-03-28 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reassigned SPARK-19868:
--

Assignee: liujianhui

> conflict TasksetManager lead to spark stopped
> -
>
> Key: SPARK-19868
> URL: https://issues.apache.org/jira/browse/SPARK-19868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: liujianhui
>Assignee: liujianhui
> Fix For: 2.2.0
>
>
> ##scenario
>  conflict taskSetManager throw an exception which lead to sparkcontext 
> stopped. log as 
> {code}
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 4571114: 4571114.2,4571114.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> the reason for that is the resubmitting of stage conflict with the running 
> stage，the missing task of stage should be resubmit since the zoombie of the 
> tasksetManager assigned by true
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting
>  ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks 
> had failed: 0
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting
>  ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at 
> MainApp.scala:73), which has no missing parents
> {code}
> the executor which the shuffle task ran on was lost
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring
>  possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4
> {code}
> the time of the task set finished and the resubmit of stage
> {code}
> handleSuccessfuleTask
> [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed
>  TaskSet 4571114.1, whose tasks have all completed, from pool 
> resubmit stage
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding
>  task set 4571114.2 with 1 tasks
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19868) conflict TasksetManager lead to spark stopped

2017-03-28 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19868.

   Resolution: Fixed
Fix Version/s: 2.2.0

> conflict TasksetManager lead to spark stopped
> -
>
> Key: SPARK-19868
> URL: https://issues.apache.org/jira/browse/SPARK-19868
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: liujianhui
> Fix For: 2.2.0
>
>
> ##scenario
>  conflict taskSetManager throw an exception which lead to sparkcontext 
> stopped. log as 
> {code}
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 4571114: 4571114.2,4571114.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> the reason for that is the resubmitting of stage conflict with the running 
> stage，the missing task of stage should be resubmit since the zoombie of the 
> tasksetManager assigned by true
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting
>  ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks 
> had failed: 0
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting
>  ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at 
> MainApp.scala:73), which has no missing parents
> {code}
> the executor which the shuffle task ran on was lost
> {code}
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring
>  possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4
> {code}
> the time of the task set finished and the resubmit of stage
> {code}
> handleSuccessfuleTask
> [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed
>  TaskSet 4571114.1, whose tasks have all completed, from pool 
> resubmit stage
> [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding
>  task set 4571114.2 with 1 tasks
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20131) Flaky Test: org.apache.spark.streaming.StreamingContextSuite

2017-03-28 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-20131:
-

 Summary: Flaky Test: 
org.apache.spark.streaming.StreamingContextSuite
 Key: SPARK-20131
 URL: https://issues.apache.org/jira/browse/SPARK-20131
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Takuya Ueshin
Priority: Minor


This test failed recently here:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/

Dashboard
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly.

Error Message
{code}
latch.await(60L, SECONDS) was false
{code}
{code}
org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was 
false
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
  at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
  at 
org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837)
  at 
org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
  at 
org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at 
org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44)
  at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
  at 
org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
  at org.scalatest.Suite$class.run(Suite.scala:1424)
  at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
  at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
  at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
  at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
  at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
  at 
org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingContextSuite.scala:44)
  at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
  at 
org.apache.spark.streaming.StreamingContextSuite.run(StreamingContextSuite.scala:44)
  at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at

[jira] [Commented] (SPARK-19551) Theme for PySpark documenation could do with improving

2017-03-28 Thread Arthur Tacca (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945685#comment-15945685
 ] 

Arthur Tacca commented on SPARK-19551:
--

Thanks, I needed the reminder! In fact the person that generated their own 
build of the docs got back to me; I hope they don't mind me pasting what they 
said here:


I've compiled the documentation using Sphinx (ver 1.3.5.). I have a foggy 
memory on this as it's been a while, but I recall I had to rollback to older 
version of sphinx to have copybutton.js to work properly - this is what allows 
us to toggle the ``>>>`` mark in the python code 

- (example) 
http://takwatanabe.me/pyspark/generated/generated/mllib.classification.DenseVector.html#mllib.classification.DenseVector
- (js file) 
https://github.com/scipy/scipy-sphinx-theme/blob/master/_theme/scipy/static/js/copybutton.js

Otherwise, I simply just used the ``autosummary`` directive offered by Sphinx 
(http://www.sphinx-doc.org/en/stable/ext/autosummary.html). You can see how I 
used these in the *.rst files in 
https://github.com/wtak23/pyspark/tree/master/source.
For instance, in order to create the **entire** html subtrees from the link 
http://takwatanabe.me/pyspark/pyspark.ml.html , all I had to have was this rst 
file: 
https://raw.githubusercontent.com/wtak23/pyspark/master/source/pyspark.ml.rst

Once you have PySpark directory included in your $PYTHONPATH envvar, you should 
be able to simply run ``make html`` using the Makefile from the github branch 
below.

https://github.com/wtak23/pyspark/tree/master 



> Theme for PySpark documenation could do with improving
> --
>
> Key: SPARK-19551
> URL: https://issues.apache.org/jira/browse/SPARK-19551
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Arthur Tacca
>Priority: Minor
>
> I have found the Python Spark documentation hard to navigate for two reasons:
> * Each page in the documentation is huge, because the whole of the 
> documentation is split up into only a few chunks.
> * The methods for each class is not listed in a short form, so the only way 
> to look through them is to browse past the full documentation for all methods 
> (including parameter lists, examples, etc.).
> This has irritated someone enough that they have done [their own build of the 
> pyspark documentation|http://takwatanabe.me/pyspark/index.html]. In 
> comparison to the official docs they are a delight to use. But of course it 
> is not clear whether they'll be kept up to date, which is why I'm asking here 
> that the official docs are improved. Perhaps that site could be used as 
> inspiration? I don't know much about these things, but it appears that the 
> main change they have made is to switch to the "read the docs" theme.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945673#comment-15945673
 ] 

Apache Spark commented on SPARK-14536:
--

User 'sureshthalamati' has created a pull request for this issue:
https://github.com/apache/spark/pull/17460

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>Assignee: Suresh Thalamati
>  Labels: NullPointerException
> Fix For: 2.2.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20130) Flaky test: BlockManagerProactiveReplicationSuite

2017-03-28 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-20130:
--

 Summary: Flaky test: BlockManagerProactiveReplicationSuite
 Key: SPARK-20130
 URL: https://issues.apache.org/jira/browse/SPARK-20130
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin


See following page:
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.storage.BlockManagerProactiveReplicationSuite_name=proactive+block+replication+-+5+replicas+-+4+block+manager+deletions

I also have seen it fail intermittently during local unit test runs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19995) Using real user to connect HiveMetastore in HiveClientImpl

2017-03-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19995.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0
   2.1.1

> Using real user to connect HiveMetastore in HiveClientImpl
> --
>
> Key: SPARK-19995
> URL: https://issues.apache.org/jira/browse/SPARK-19995
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.1.1, 2.2.0
>
>
> If user specify "--proxy-user" in kerberized environment with Hive catalog 
> implementation, HiveClientImpl will try to connect hive metastore with 
> current user. While we use real user to do kinit, this will make connection 
> failure. We should change like what we did before in yarn code to use real 
> user.
> {noformat}
> ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:188)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:366)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:270)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:65)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
>

[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1594#comment-1594
 ] 

Kazuaki Ishizaki commented on SPARK-20112:
--

[~MasterDDT] Thank you for preparing additional information. The size of hashed 
relation does not seem to be very large.

In these two cases, I cannot correlate load instructions, which caused SIGSEGV, 
to Java statements.

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}
> This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, 
> but that is marked fix in 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate

2017-03-28 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-20129:
-

 Summary: JavaSparkContext should use SparkContext.getOrCreate
 Key: SPARK-20129
 URL: https://issues.apache.org/jira/browse/SPARK-20129
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 2.1.0
Reporter: Xiangrui Meng


It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate

2017-03-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-20129:
-

Assignee: Xiangrui Meng

> JavaSparkContext should use SparkContext.getOrCreate
> 
>
> Key: SPARK-20129
> URL: https://issues.apache.org/jira/browse/SPARK-20129
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399
 ] 

Mitesh edited comment on SPARK-20112 at 3/28/17 3:46 PM:
-

[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10MB, but a few are 200MB. The non-hashed cached tables 
are 1-100G

{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}




was (Author: masterddt):
[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10MB, but a few are 200MB.


{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}



> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
>

[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399
 ] 

Mitesh edited comment on SPARK-20112 at 3/28/17 3:46 PM:
-

[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10MB, but a few are 200MB.


{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}




was (Author: masterddt):
[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10M, but a few are 200M.


{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}



> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I

[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399
 ] 

Mitesh edited comment on SPARK-20112 at 3/28/17 3:40 PM:
-

[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10M, but a few are 200M.


{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}




was (Author: masterddt):
[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10M, but a few are 200M.

{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}



> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I

[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945399#comment-15945399
 ] 

Mitesh commented on SPARK-20112:


[~kiszk] I can try out spark 2.0.3+ or 2.1. Actually I disabled wholestage 
codegen and I do see a failure still on 2.0.2, but in a different place now in 
{{HashJoin.advanceNext}}. Also uploaded the new hs_err_pid22870. The hashed 
relations are around 1-10M, but a few are 200M.

{noformat}
17/03/27 22:15:59 DEBUG [Executor task launch worker-17] TaskMemoryManager: 
Task 152119 acquired 64.0 KB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
SIGSEGV17/03/27 22:15:59 DEBUG [Executor task launch worker-17] 
TaskMemoryManager: Task 152119 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@2b60e395
[thread 140369911781120 also had an error]
 (0xb) at pc=0x7fad1f7afc11, pid=22870, tid=140369909675776
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 25558 C2 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$outerJoin$1$$anon$1.advanceNext()Z
 (110 bytes) @ 0x7fad1f7afc11 [0x7fad1f7afb20+0xf1]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/xvdb/spark/worker_dir/app-20170327213416-0005/14/hs_err_pid22870.log
17/03/27 22:15:59 DEBUG [Executor task launch worker-19] TaskMemoryManager: 
Task 152090 acquired 64.0 MB for 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@51de5289
 2502.591: [G1Ergonomics (Concurrent Cycles) request concurrent cycle 
initiation, reason: occupancy higher than threshold, occupancy: 7667187712 
bytes, allocation request: 160640 bytes, threshold: 8214124950 bytes (45.00 
%), source: concurrent humongous allocation]
[thread 140376087648000 also had an error]
[thread 140369903376128 also had an error]
#
{noformat}



> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}
> This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, 
> but that is marked fix in 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Mitesh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitesh updated SPARK-20112:
---
Attachment: hs_err_pid22870.log

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}
> This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, 
> but that is marked fix in 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-28 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945379#comment-15945379
 ] 

Yuming Wang edited comment on SPARK-20107 at 3/28/17 3:37 PM:
--

OK, I will add {{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version}} 
to document.


was (Author: q79969786):
OK, I will add {{spark.mapreduce.fileoutputcommitter.algorithm.version}} to 
document.

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20109:


Assignee: (was: Apache Spark)

> Need a way to convert from IndexedRowMatrix to Dense Block Matrices
> ---
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices

2017-03-28 Thread John Compitello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Compitello updated SPARK-20109:

Description: The current implementation of toBlockMatrix on 
IndexedRowMatrix is insufficient. It is implemented by first converting the 
IndexedRowMatrix to a CoordinateMatrix, then converting that CoordinateMatrix 
to a BlockMatrix. Not only is this slower than it needs to be, it also means 
that the created BlockMatrix ends up being backed by instances of SparseMatrix, 
which a user may not want. Users need an option to convert from 
IndexedRowMatrix to BlockMatrix that backs the BlockMatrix with local instances 
of DenseMatrix.   (was: The current implementation of toBlockMatrix on 
IndexedRowMatrix is insufficient. It is implemented by first converting the 
IndexedRowMatrix to a CoordinateMatrix, then converting that CoordinateMatrix 
to a BlockMatrix. Not only is this slower than it needs to be, it also means 
that the created BlockMatrix ends up being backed by instances of SparseMatrix, 
which a user may not want. )

> Need a way to convert from IndexedRowMatrix to Dense Block Matrices
> ---
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. Users need an option to convert from IndexedRowMatrix to 
> BlockMatrix that backs the BlockMatrix with local instances of DenseMatrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945388#comment-15945388
 ] 

Apache Spark commented on SPARK-20109:
--

User 'johnc1231' has created a pull request for this issue:
https://github.com/apache/spark/pull/17459

> Need a way to convert from IndexedRowMatrix to Dense Block Matrices
> ---
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-03-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945389#comment-15945389
 ] 

Kazuaki Ishizaki commented on SPARK-20112:
--

SPARK-18745 fixed integer overflow issues in {{HashedRelation.scala}} due to 
large data, which was merged into post-2.0.2.
If the data is very large, would it be possible to have a change to try it with 
the latest branch-2.0?

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}
> This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, 
> but that is marked fix in 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20109:


Assignee: Apache Spark

> Need a way to convert from IndexedRowMatrix to Dense Block Matrices
> ---
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>Assignee: Apache Spark
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20109) Need a way to convert from IndexedRowMatrix to Dense Block Matrices

2017-03-28 Thread John Compitello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Compitello updated SPARK-20109:

Summary: Need a way to convert from IndexedRowMatrix to Dense Block 
Matrices  (was: Need a way to convert from IndexedRowMatrix to Block)

> Need a way to convert from IndexedRowMatrix to Dense Block Matrices
> ---
>
> Key: SPARK-20109
> URL: https://issues.apache.org/jira/browse/SPARK-20109
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: John Compitello
>
> The current implementation of toBlockMatrix on IndexedRowMatrix is 
> insufficient. It is implemented by first converting the IndexedRowMatrix to a 
> CoordinateMatrix, then converting that CoordinateMatrix to a BlockMatrix. Not 
> only is this slower than it needs to be, it also means that the created 
> BlockMatrix ends up being backed by instances of SparseMatrix, which a user 
> may not want. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-28 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945379#comment-15945379
 ] 

Yuming Wang commented on SPARK-20107:
-

OK, I will add {{spark.mapreduce.fileoutputcommitter.algorithm.version}} to 
document.

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh edited comment on SPARK-19981 at 3/28/17 3:21 PM:
-

As I mentioned on the PR, maybe its better to fix it here?

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)

*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
{code}


was (Author: masterddt):
As I mentioned on the PR, maybe a better to fix it here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)

*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,

[jira] [Resolved] (SPARK-20126) Remove HiveSessionState

2017-03-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20126.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17457
[https://github.com/apache/spark/pull/17457]

> Remove HiveSessionState
> ---
>
> Key: SPARK-20126
> URL: https://issues.apache.org/jira/browse/SPARK-20126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> After SPARK-20100 the added value of the HiveSessionState has become quite 
> limited. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20124) Join reorder should keep the same order of final project attributes

2017-03-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20124.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17453
[https://github.com/apache/spark/pull/17453]

> Join reorder should keep the same order of final project attributes
> ---
>
> Key: SPARK-20124
> URL: https://issues.apache.org/jira/browse/SPARK-20124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Join reorder algorithm should keep exactly the same order of output 
> attributes in the top project. 
> For example, if user want to select a, b, c, after reordering, we should 
> output a, b, c in the same order as specified by user, instead of b, a, c or 
> other orders.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20124) Join reorder should keep the same order of final project attributes

2017-03-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20124:
---

Assignee: Zhenhua Wang

> Join reorder should keep the same order of final project attributes
> ---
>
> Key: SPARK-20124
> URL: https://issues.apache.org/jira/browse/SPARK-20124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Join reorder algorithm should keep exactly the same order of output 
> attributes in the top project. 
> For example, if user want to select a, b, c, after reordering, we should 
> output a, b, c in the same order as specified by user, instead of b, a, c or 
> other orders.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20107) Speed up HadoopMapReduceCommitProtocol#commitJob for many output files

2017-03-28 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945177#comment-15945177
 ] 

Imran Rashid commented on SPARK-20107:
--

>From Marcelo's comment on the PR:

bq. We shouldn't set the default to this value. Algo v2 has different 
correctness characteristics from v1 - yeah, it's faster especially in FSes like 
S3, but it also more likely to lead to bad data when things fail.

bq. If a user is comfortable with the trade-offs they can set this in their own 
configuration.

It would be nice if you want to turn this jira into documenting the option, 
otherwise this should be closed with a "won't fix"

> Speed up HadoopMapReduceCommitProtocol#commitJob for many output files
> --
>
> Key: SPARK-20107
> URL: https://issues.apache.org/jira/browse/SPARK-20107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Set {{mapreduce.fileoutputcommitter.algorithm.version=2}} to speed up 
> [HadoopMapReduceCommitProtocol#commitJob|https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121]
>  for many output files.
> It can speed up {{11 minutes}} for 216869 output files:
> {code:sql}
> CREATE TABLE tmp.spark_20107 AS SELECT
>   category_id,
>   product_id,
>   track_id,
>   concat(
> substr(ds, 3, 2),
> substr(ds, 6, 2),
> substr(ds, 9, 2)
>   ) shortDate,
>   CASE WHEN actiontype = '0' THEN 'browse' WHEN actiontype = '1' THEN 'fav' 
> WHEN actiontype = '2' THEN 'cart' WHEN actiontype = '3' THEN 'order' ELSE 
> 'invalid actio' END AS type
> FROM
>   tmp.user_action
> WHERE
>   ds > date_sub('2017-01-23', 730)
> AND actiontype IN ('0','1','2','3');
> {code}
> {code}
> $ hadoop fs -ls /user/hive/warehouse/tmp.db/spark_20107 | wc -l
> 216870
> {code}
> This improvement can effect all cloudera's hadoop cdh5-2.6.0_5.4.0 higher 
> versions(see: 
> [cloudera/hadoop-common@1c12361|https://github.com/cloudera/hadoop-common/commit/1c1236182304d4075276c00c4592358f428bc433]
>  and 
> [cloudera/hadoop-common@16b2de2|https://github.com/cloudera/hadoop-common/commit/16b2de27321db7ce2395c08baccfdec5562017f0])
>  and apache's hadoop 2.7.0 higher versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh edited comment on SPARK-19981 at 3/28/17 1:35 PM:
-

As I mentioned on the PR, maybe a better to fix it here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)

*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
{code}


was (Author: masterddt):
As I mentioned on the PR, maybe a better to fix it here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]

[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh edited comment on SPARK-19981 at 3/28/17 1:34 PM:
-

As I mentioned on the PR, maybe a better way is to handle it here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
{code}


was (Author: masterddt):
As I mentioned on the PR, this seems like it should be handled here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,

[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh edited comment on SPARK-19981 at 3/28/17 1:34 PM:
-

As I mentioned on the PR, this seems like it should be handled here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
{code}


was (Author: masterddt):
As I mentioned on the PR, this seems like it should be handled here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {[newA}] is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,

[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh edited comment on SPARK-19981 at 3/28/17 1:34 PM:
-

As I mentioned on the PR, maybe a better to fix it here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
{code}


was (Author: masterddt):
As I mentioned on the PR, maybe a better way is to handle it here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {{newA}} is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,

[jira] [Comment Edited] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh edited comment on SPARK-19981 at 3/28/17 1:33 PM:
-

As I mentioned on the PR, this seems like it should be handled here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so {{a#1}} is the same as 
{{a#1 as newA#2}}.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on {[newA}] is unnecessary.

{code}
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
{code}


was (Author: masterddt):
As I mentioned on the PR, this seems like it should be handled here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so `a#1` is the same as `a#1 
as newA#2`.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on `newA` is unnecessary.

```
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,

[jira] [Commented] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-28 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945166#comment-15945166
 ] 

Mitesh commented on SPARK-19981:


As I mentioned on the PR, this seems like it should be handled here:

https://github.com/maropu/spark/blob/b5d1038ed5d65a6ddec20ea6eef186d25fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L41

Perhaps it should handle canonicalizing an alias so `a#1` is the same as `a#1 
as newA#2`.

Otherwise you have a similar problem with sorting. Here is a sort example with 
1 partition. I believe the extra sort on `newA` is unnecessary.

```
scala> val df1 = Seq((1, 2), (3, 4)).toDF("a", 
"b").coalesce(1).sortWithinPartitions("a")
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int]

scala> val df2 = df1.selectExpr("a as newA", "b")
df2: org.apache.spark.sql.DataFrame = [newA: int, b: int]

scala> println(df1.join(df2, df1("a") === 
df2("newA")).queryExecution.executedPlan)
*SortMergeJoin [args=[a#37225], [newA#37232], 
Inner][outPart=PartitioningCollection(1, )][outOrder=List(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL, 
newA#37232:int%NONNULL, b#37243:int%NONNULL)]
:- *Sort [args=[a#37225 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37225 
ASC%NONNULL)][output=List(a#37225:int%NONNULL, b#37226:int%NONNULL)]
:  +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
: +- LocalTableScan [args=[a#37225, 
b#37226]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37225:int%NONNULL,
 b#37226:int%NONNULL)]
+- *Sort [args=[newA#37232 ASC], false, 
0][outPart=SinglePartition][outOrder=List(newA#37232 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
   +- *Project [args=[a#37242 AS newA#37232, 
b#37243]][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=ArrayBuffer(newA#37232:int%NONNULL, b#37243:int%NONNULL)]
  +- *Sort [args=[a#37242 ASC], false, 
0][outPart=SinglePartition][outOrder=ArrayBuffer(a#37242 
ASC%NONNULL)][output=List(a#37242:int%NONNULL, b#37243:int%NONNULL)]
 +- Coalesce 
[args=1][outPart=SinglePartition][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]
+- LocalTableScan [args=[a#37242, 
b#37243]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(a#37242:int%NONNULL,
 b#37243:int%NONNULL)]```

> Sort-Merge join inserts shuffles when joining dataframes with aliased columns
> -
>
> Key: SPARK-19981
> URL: https://issues.apache.org/jira/browse/SPARK-19981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Allen George
>
> Performing a sort-merge join with two dataframes - each of which has the join 
> column aliased - causes Spark to insert an unnecessary shuffle.
> Consider the scala test code below, which should be equivalent to the 
> following SQL.
> {code:SQL}
> SELECT * FROM
>   (SELECT number AS aliased from df1) t1
> LEFT JOIN
>   (SELECT number AS aliased from df2) t2
> ON t1.aliased = t2.aliased
> {code}
> {code:scala}
> private case class OneItem(number: Long)
> private case class TwoItem(number: Long, value: String)
> test("join with aliases should not trigger shuffle") {
>   val df1 = sqlContext.createDataFrame(
> Seq(
>   OneItem(0),
>   OneItem(2),
>   OneItem(4)
> )
>   )
>   val partitionedDf1 = df1.repartition(10, col("number"))
>   partitionedDf1.createOrReplaceTempView("df1")
>   partitionedDf1.cache() partitionedDf1.count()
>   
>   val df2 = sqlContext.createDataFrame(
> Seq(
>   TwoItem(0, "zero"),
>   TwoItem(2, "two"),
>   TwoItem(4, "four")
> )
>   )
>   val partitionedDf2 = df2.repartition(10, col("number"))
>   partitionedDf2.createOrReplaceTempView("df2")
>   partitionedDf2.cache() partitionedDf2.count()
>   
>   val fromDf1 = sqlContext.sql("SELECT number from df1")
>   val fromDf2 = sqlContext.sql("SELECT number from df2")
>   val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased")
>   val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased")
>   aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") }
> {code}
> Both the SQL and the Scala code generate a query-plan where an extra exchange 
> is inserted before performing the sort-merge join. This exchange changes the 
> partitioning from {{HashPartitioning("number", 10)}} for each frame being 
> joined into {{HashPartitioning("aliased", 5)}}. I would have expected that 
> since it's a simple column aliasing, and both frames have exactly the same 
> partitioning that the initial frames.
> {noformat} 
> *Project

[jira] [Updated] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

2017-03-28 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-20128:
-
Description: 
One Jenkins run failed due to the MetricsSystem never getting killed after a 
failed test, which led that test to hang and the tests to timeout:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176

{noformat}
17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
down SparkContext
java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
at scala.Option.flatMap(Option.scala:171)
at 
org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
BlockManagerMaster stopped
17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
Exception was suppressed.
java.lang.NullPointerException
at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
at 
com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
...
{noformat}

unfortunately I didn't save the entire test logs, but what happens is the 
initial IndexOutOfBoundsException is a real bug, which causes the SparkContext 
to stop, and the test to fail.  However, the MetricsSystem somehow stays alive, 
and since its not a daemon thread, it just hangs, and every 20 mins we get that 
NPE from within the metrics system as it tries to report.

I am totally perplexed at how this can happen, it looks like the metric system 
should always get stopped by the time we see

{noformat}
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
{noformat}

I don't think I've ever seen this in a real spark use, but it doesn't look like 
something which is limited to tests, whatever the cause.

  was:
One Jenkins run failed due to the MetricsSystem never getting killed after a 
failed test, which led that test to hang and the tests to timeout:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176

{noformat}
17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
down SparkContext
java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
at scala.Option.flatMap(Option.scala:171)
at 
org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
17/03/24 13:44:19.546

[jira] [Created] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

2017-03-28 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-20128:


 Summary: MetricsSystem not always killed in SparkContext.stop()
 Key: SPARK-20128
 URL: https://issues.apache.org/jira/browse/SPARK-20128
 Project: Spark
  Issue Type: Test
  Components: Spark Core, Tests
Affects Versions: 2.2.0
Reporter: Imran Rashid


One Jenkins run failed due to the MetricsSystem never getting killed after a 
failed test, which led that test to hang and the tests to timeout:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176

{noformat}
17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
down SparkContext
java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
at scala.Option.flatMap(Option.scala:171)
at 
org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
BlockManagerMaster stopped
17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
Exception was suppressed.
java.lang.NullPointerException
at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
at 
com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
...
{noformat}

unfortunately I didn't save the entire test logs, but what happens is the 
initial IndexOutOfBoundsException is a real bug, which causes the SparkContext 
to stop, and the test to fail.  However, the MetricsSystem somehow stays alive, 
and since its not a daemon thread, it just hangs, and every 20 mins we get that 
NPE from within the metrics system as it tries to report.

I am totally perplexed at how this can happen, it looks like the metric system 
should always get stopped by the time we see

{noformat}
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
{noformat]

I don't think I've ever seen this in a real spark use, but it doesn't look like 
something which is limited to tests, whatever the cause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20127:


Assignee: Apache Spark

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Assignee: Apache Spark
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20127:


Assignee: (was: Apache Spark)

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945107#comment-15945107
 ] 

Apache Spark commented on SPARK-20127:
--

User 'dbolshak' has created a pull request for this issue:
https://github.com/apache/spark/pull/17458

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20126) Remove HiveSessionState

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20126:


Assignee: Herman van Hovell  (was: Apache Spark)

> Remove HiveSessionState
> ---
>
> Key: SPARK-20126
> URL: https://issues.apache.org/jira/browse/SPARK-20126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> After SPARK-20100 the added value of the HiveSessionState has become quite 
> limited. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20126) Remove HiveSessionState

2017-03-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20126:


Assignee: Apache Spark  (was: Herman van Hovell)

> Remove HiveSessionState
> ---
>
> Key: SPARK-20126
> URL: https://issues.apache.org/jira/browse/SPARK-20126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> After SPARK-20100 the added value of the HiveSessionState has become quite 
> limited. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20126) Remove HiveSessionState

2017-03-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945072#comment-15945072
 ] 

Apache Spark commented on SPARK-20126:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17457

> Remove HiveSessionState
> ---
>
> Key: SPARK-20126
> URL: https://issues.apache.org/jira/browse/SPARK-20126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> After SPARK-20100 the added value of the HiveSessionState has become quite 
> limited. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945058#comment-15945058
 ] 

Sean Owen commented on SPARK-20127:
---

We use pull requests to suggest changes, but before you do, I think most of 
those changes have not been made so far in the code on purpose.

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Denis Bolshakov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945051#comment-15945051
 ] 

Denis Bolshakov edited comment on SPARK-20127 at 3/28/17 12:32 PM:
---

I applied yours first comments (just by reverting changes).
Full patch could be found here
https://github.com/dbolshak/spark/tree/SPARK-20127


was (Author: bolshakov.de...@gmail.com):
I applied you first comments (just by reverting changes).
Full patch could be found here
https://github.com/dbolshak/spark/tree/SPARK-20127

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Denis Bolshakov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945051#comment-15945051
 ] 

Denis Bolshakov commented on SPARK-20127:
-

I applied you first comments (just by reverting changes).
Full patch could be found here
https://github.com/dbolshak/spark/tree/SPARK-20127

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Denis Bolshakov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945014#comment-15945014
 ] 

Denis Bolshakov commented on SPARK-20127:
-

You can review changes shortly here
https://github.com/dbolshak/spark/commit/8a31173b436eb7d39f2423eccda7932239f8d5b6

Nothing else right now.

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20127:
--
Affects Version/s: (was: 2.3.0)
   2.1.0
   Labels:   (was: newbiee)
 Priority: Minor  (was: Major)

We don't assign issues at this stage in general. I would avoid changes that are 
not helping perf or correctness. May be better to discuss the kinds of things 
you would change here first. 

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Denis Bolshakov
>Priority: Minor
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Denis Bolshakov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945002#comment-15945002
 ] 

Denis Bolshakov edited comment on SPARK-20127 at 3/28/17 11:46 AM:
---

Hello [~srowen], thanks for quick feedback.

Could you please assign issue to me? ))

I understand your point, and I've tried to exclude changes related to code 
style or dispute changes. I believe there are no changes that could have 
performance impact.

Kind regards,
Denis


was (Author: bolshakov.de...@gmail.com):
Hello [~srowen], thanks for quick feedback.

Could you please assign issue to me? ))

I understand your point I've tried to exclude changes related to code style or 
dispute changes. I believe there are no changes that could have performance 
impact.

Kind regards,
Denis

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Denis Bolshakov
>  Labels: newbiee
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Denis Bolshakov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945002#comment-15945002
 ] 

Denis Bolshakov commented on SPARK-20127:
-

Hello [~srowen], thanks for quick feedback.

Could you please assign issue to me? ))

I understand your point I've tried to exclude changes related to code style or 
dispute changes. I believe there are no changes that could have performance 
impact.

Kind regards,
Denis

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Denis Bolshakov
>  Labels: newbiee
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20094) Should Prevent push down of IN subquery to Join operator

2017-03-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20094.
---
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.2.0

> Should Prevent push down of IN subquery to Join operator
> 
>
> Key: SPARK-20094
> URL: https://issues.apache.org/jira/browse/SPARK-20094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> ReorderJoin collects all predicates and try to put them into join condition 
> when creating ordered join. If a predicate with an IN subquery is in a join 
> condition instead of a filter condition, 
> `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the 
> subquery to an ExistenceJoin, and thus result in error.
> For example, tpcds q45 fails due to the above reason:
> {noformat}
> spark-sql> explain codegen
>  > SELECT
>  >   ca_zip,
>  >   ca_city,
>  >   sum(ws_sales_price)
>  > FROM web_sales, customer, customer_address, date_dim, item
>  > WHERE ws_bill_customer_sk = c_customer_sk
>  >   AND c_current_addr_sk = ca_address_sk
>  >   AND ws_item_sk = i_item_sk
>  >   AND (substr(ca_zip, 1, 5) IN
>  >   ('85669', '86197', '88274', '83405', '86475', '85392', '85460', 
> '80348', '81792')
>  >   OR
>  >   i_item_id IN (SELECT i_item_id
>  >   FROM item
>  >   WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
>  >   )
>  > )
>  >   AND ws_sold_date_sk = d_date_sk
>  >   AND d_qoy = 2 AND d_year = 2001
>  > GROUP BY ca_zip, ca_city
>  > ORDER BY ca_zip, ca_city
>  > LIMIT 100;
> 17/03/25 15:27:02 ERROR SparkSQLDriver: Failed in [explain codegen
>   
> SELECT
>   ca_zip,
>   ca_city,
>   sum(ws_sales_price)
> FROM web_sales, customer, customer_address, date_dim, item
> WHERE ws_bill_customer_sk = c_customer_sk
>   AND c_current_addr_sk = ca_address_sk
>   AND ws_item_sk = i_item_sk
>   AND (substr(ca_zip, 1, 5) IN
>   ('85669', '86197', '88274', '83405', '86475', '85392', '85460', '80348', 
> '81792')
>   OR
>   i_item_id IN (SELECT i_item_id
>   FROM item
>   WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
>   )
> )
>   AND ws_sold_date_sk = d_date_sk
>   AND d_qoy = 2 AND d_year = 2001
> GROUP BY ca_zip, ca_city
> ORDER BY ca_zip, ca_city
> LIMIT 100]
> java.lang.UnsupportedOperationException: Cannot evaluate expression: list#1 []
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.ListQuery.doGenCode(subquery.scala:262)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199)
>   at 
> org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.In.doGenCode(predicates.scala:199)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.Or.doGenCode(predicates.scala:379)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
>

[jira] [Commented] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944984#comment-15944984
 ] 

Sean Owen commented on SPARK-20127:
---

I am a fan of improving code style and static inspection. However we have 
generally declined to do big-bang modifications to the code base where the 
change is purely a question of style. Otherwise I would have done this a long 
time ago :)

Before starting, have a look and see if any are correctness or significant 
performance issues, or small changes that would resolve a consistency problem. 
Mention them here first.

> Minor code cleanup
> --
>
> Key: SPARK-20127
> URL: https://issues.apache.org/jira/browse/SPARK-20127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Denis Bolshakov
>  Labels: newbiee
>
> Intellij IDEA shows a bunch of messages while inspecting source code.
> So fixing the most explicit ones gives the following:
> - improving source code quality
> - involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20127) Minor code cleanup

2017-03-28 Thread Denis Bolshakov (JIRA)

Denis Bolshakov created SPARK-20127:
---

 Summary: Minor code cleanup
 Key: SPARK-20127
 URL: https://issues.apache.org/jira/browse/SPARK-20127
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Denis Bolshakov


Intellij IDEA shows a bunch of messages while inspecting source code.

So fixing the most explicit ones gives the following:
- improving source code quality
- involving to contributing process



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20126) Remove HiveSessionState

2017-03-28 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-20126:
-

 Summary: Remove HiveSessionState
 Key: SPARK-20126
 URL: https://issues.apache.org/jira/browse/SPARK-20126
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


After SPARK-20100 the added value of the HiveSessionState has become quite 
limited. We should remove it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20123) $SPARK_HOME variable might have spaces in it(e.g. $SPARK_HOME=/home/spark build/spark), then build spark failed.

2017-03-28 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-20123:

Description: If $SPARK_HOME or $FWDIR variable contains spaces, then use 
"./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 
-Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.  (was: If 
$SPARK_HOME or $FWDIR variable contains spaces, then use 
"./dev/make-distribution.sh --r --name custom-spark --tgz -Psparkr -Phadoop-2.4 
-Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.)

> $SPARK_HOME variable might have spaces in it(e.g. $SPARK_HOME=/home/spark 
> build/spark), then build spark failed.
> 
>
> Key: SPARK-20123
> URL: https://issues.apache.org/jira/browse/SPARK-20123
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>Priority: Minor
>
> If $SPARK_HOME or $FWDIR variable contains spaces, then use 
> "./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.7 
> -Phive -Phive-thriftserver -Pmesos -Pyarn" build spark will failed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14228) Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped

2017-03-28 Thread Amitabh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944912#comment-15944912
 ] 

Amitabh commented on SPARK-14228:
-

Hi, can you specify the version you were working with? I have received the same 
error in 1.6.0

> Lost executor of RPC disassociated, and occurs exception: Could not find 
> CoarseGrainedScheduler or it has been stopped
> --
>
> Key: SPARK-14228
> URL: https://issues.apache.org/jira/browse/SPARK-14228
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>
> When I start 1000 executors, and then stop the process. It will call 
> SparkContext.stop to stop all executors. But during this process, the 
> executors has been killed will lost of rpc with driver, and try to 
> reviveOffers, but can't find CoarseGrainedScheduler or it has been stopped.
> {quote}
> 16/03/29 01:45:45 ERROR YarnScheduler: Lost executor 610 on 51-196-152-8: 
> remote Rpc client disassociated
> 16/03/29 01:45:45 ERROR Inbox: Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
> has been stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:173)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:398)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:314)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:482)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:261)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:144)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE

2017-03-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944887#comment-15944887
 ] 

Steve Loughran commented on SPARK-10294:


consider it a failure in the exception logic; it tries to close streams, and, 
if one is fully/partially closed, an NPE is raised. the Parquet update may fix 
it in parquet, but the problem still lurks. A robust exception cleanup would be 
resilient to it ever recurring.

Looking at the original cause, S3 file too big. That problem has gone away now; 
in tests s3a will happily stream 80+GB to a single blob

> When Parquet writer's close method throws an exception, we will call close 
> again and trigger a NPE
> --
>
> Key: SPARK-10294
> URL: https://issues.apache.org/jira/browse/SPARK-10294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> When a task saves a large parquet file (larger than the S3 file size limit) 
> to S3, looks like we still call parquet writer's close twice and triggers NPE 
> reported in SPARK-7837. Eventually, job failed and I got NPE as the 
> exception. Actually, the real problem was that the file was too large for S3.
> {code}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>

[jira] [Updated] (SPARK-20094) Should Prevent push down of IN subquery to Join operator

2017-03-28 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-20094:
-
Summary: Should Prevent push down of IN subquery to Join operator  (was: 
Putting predicate with IN subquery into join condition in ReorderJoin fails 
RewritePredicateSubquery.rewriteExistentialExpr)

> Should Prevent push down of IN subquery to Join operator
> 
>
> Key: SPARK-20094
> URL: https://issues.apache.org/jira/browse/SPARK-20094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> ReorderJoin collects all predicates and try to put them into join condition 
> when creating ordered join. If a predicate with an IN subquery is in a join 
> condition instead of a filter condition, 
> `RewritePredicateSubquery.rewriteExistentialExpr` would fail to convert the 
> subquery to an ExistenceJoin, and thus result in error.
> For example, tpcds q45 fails due to the above reason:
> {noformat}
> spark-sql> explain codegen
>  > SELECT
>  >   ca_zip,
>  >   ca_city,
>  >   sum(ws_sales_price)
>  > FROM web_sales, customer, customer_address, date_dim, item
>  > WHERE ws_bill_customer_sk = c_customer_sk
>  >   AND c_current_addr_sk = ca_address_sk
>  >   AND ws_item_sk = i_item_sk
>  >   AND (substr(ca_zip, 1, 5) IN
>  >   ('85669', '86197', '88274', '83405', '86475', '85392', '85460', 
> '80348', '81792')
>  >   OR
>  >   i_item_id IN (SELECT i_item_id
>  >   FROM item
>  >   WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
>  >   )
>  > )
>  >   AND ws_sold_date_sk = d_date_sk
>  >   AND d_qoy = 2 AND d_year = 2001
>  > GROUP BY ca_zip, ca_city
>  > ORDER BY ca_zip, ca_city
>  > LIMIT 100;
> 17/03/25 15:27:02 ERROR SparkSQLDriver: Failed in [explain codegen
>   
> SELECT
>   ca_zip,
>   ca_city,
>   sum(ws_sales_price)
> FROM web_sales, customer, customer_address, date_dim, item
> WHERE ws_bill_customer_sk = c_customer_sk
>   AND c_current_addr_sk = ca_address_sk
>   AND ws_item_sk = i_item_sk
>   AND (substr(ca_zip, 1, 5) IN
>   ('85669', '86197', '88274', '83405', '86475', '85392', '85460', '80348', 
> '81792')
>   OR
>   i_item_id IN (SELECT i_item_id
>   FROM item
>   WHERE i_item_sk IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
>   )
> )
>   AND ws_sold_date_sk = d_date_sk
>   AND d_qoy = 2 AND d_year = 2001
> GROUP BY ca_zip, ca_city
> ORDER BY ca_zip, ca_city
> LIMIT 100]
> java.lang.UnsupportedOperationException: Cannot evaluate expression: list#1 []
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.ListQuery.doGenCode(subquery.scala:262)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199)
>   at 
> org.apache.spark.sql.catalyst.expressions.In$$anonfun$3.apply(predicates.scala:199)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.In.doGenCode(predicates.scala:199)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.Or.doGenCode(predicates.scala:379)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
>

1 2 >

1 - 100 of 138 matches

Mail list logo