[jira] [Assigned] (SPARK-20719) Support LIMIT ALL

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20719:


Assignee: Apache Spark  (was: Xiao Li)

> Support LIMIT ALL
> -
>
> Key: SPARK-20719
> URL: https://issues.apache.org/jira/browse/SPARK-20719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by 
> both PrestgreSQL and Presto. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20719) Support LIMIT ALL

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20719:


Assignee: Xiao Li  (was: Apache Spark)

> Support LIMIT ALL
> -
>
> Key: SPARK-20719
> URL: https://issues.apache.org/jira/browse/SPARK-20719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by 
> both PrestgreSQL and Presto. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20719) Support LIMIT ALL

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007631#comment-16007631
 ] 

Apache Spark commented on SPARK-20719:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17960

> Support LIMIT ALL
> -
>
> Key: SPARK-20719
> URL: https://issues.apache.org/jira/browse/SPARK-20719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by 
> both PrestgreSQL and Presto. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20719) Support LIMIT ALL

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20719:

Description: `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is 
supported by both PrestgreSQL and Presto.   (was: `LIMIT ALL` is the same as 
omitting the `LIMIT` clause. It is supported by at least Postgres and 
PrestgreSQL and Presto. )

> Support LIMIT ALL
> -
>
> Key: SPARK-20719
> URL: https://issues.apache.org/jira/browse/SPARK-20719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> `LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by 
> both PrestgreSQL and Presto. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20719) Support LIMIT ALL

2017-05-11 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20719:
---

 Summary: Support LIMIT ALL
 Key: SPARK-20719
 URL: https://issues.apache.org/jira/browse/SPARK-20719
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


`LIMIT ALL` is the same as omitting the `LIMIT` clause. It is supported by at 
least Postgres and PrestgreSQL and Presto. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007599#comment-16007599
 ] 

Marcelo Vanzin commented on SPARK-20608:


That looks like the configuration you're using for the {{hdfs}} command and the 
one you used for the {{spark-submit}} command are not the same. Check your env 
variables.

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20665) Spark-sql, "Bround" and "Round" function return NULL

2017-05-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20665:
---

Assignee: liuxian

> Spark-sql, "Bround" and "Round" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: liuxian
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null
> "Round" has the same problem:
> >select round(12.3, 2);
> >NULL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20665) Spark-sql, "Bround" and "Round" function return NULL

2017-05-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20665.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0
   2.0.3

Issue resolved by pull request 17906
[https://github.com/apache/spark/pull/17906]

> Spark-sql, "Bround" and "Round" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: liuxian
> Fix For: 2.0.3, 2.2.0, 2.1.2
>
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null
> "Round" has the same problem:
> >select round(12.3, 2);
> >NULL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20718) FileSourceScanExec with different filter orders should be the same after canonicalization

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20718:


Assignee: Apache Spark

> FileSourceScanExec with different filter orders should be the same after 
> canonicalization
> -
>
> Key: SPARK-20718
> URL: https://issues.apache.org/jira/browse/SPARK-20718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
> Usually this is ok because of canonicalization. However, in 
> `FileSourceScanExec`, its data filters and partition filters are sequences, 
> and their orders are not canonicalized. So `def sameResult` returns different 
> results for different orders of data/partition filters. This leads to, e.g. 
> different decision for `ReuseExchange`, and thus results in unstable 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20718) FileSourceScanExec with different filter orders should be the same after canonicalization

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20718:


Assignee: (was: Apache Spark)

> FileSourceScanExec with different filter orders should be the same after 
> canonicalization
> -
>
> Key: SPARK-20718
> URL: https://issues.apache.org/jira/browse/SPARK-20718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
> Usually this is ok because of canonicalization. However, in 
> `FileSourceScanExec`, its data filters and partition filters are sequences, 
> and their orders are not canonicalized. So `def sameResult` returns different 
> results for different orders of data/partition filters. This leads to, e.g. 
> different decision for `ReuseExchange`, and thus results in unstable 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20718) FileSourceScanExec with different filter orders should be the same after canonicalization

2017-05-11 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-20718:
-
Summary: FileSourceScanExec with different filter orders should be the same 
after canonicalization  (was: FileSourceScanExec with different filter orders 
should have the same result)

> FileSourceScanExec with different filter orders should be the same after 
> canonicalization
> -
>
> Key: SPARK-20718
> URL: https://issues.apache.org/jira/browse/SPARK-20718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
> Usually this is ok because of canonicalization. However, in 
> `FileSourceScanExec`, its data filters and partition filters are sequences, 
> and their orders are not canonicalized. So `def sameResult` returns different 
> results for different orders of data/partition filters. This leads to, e.g. 
> different decision for `ReuseExchange`, and thus results in unstable 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20718) FileSourceScanExec with different filter orders should be the same after canonicalization

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007582#comment-16007582
 ] 

Apache Spark commented on SPARK-20718:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17959

> FileSourceScanExec with different filter orders should be the same after 
> canonicalization
> -
>
> Key: SPARK-20718
> URL: https://issues.apache.org/jira/browse/SPARK-20718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
> Usually this is ok because of canonicalization. However, in 
> `FileSourceScanExec`, its data filters and partition filters are sequences, 
> and their orders are not canonicalized. So `def sameResult` returns different 
> results for different orders of data/partition filters. This leads to, e.g. 
> different decision for `ReuseExchange`, and thus results in unstable 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20716) StateStore.abort() should not throw further exception

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20716:


Assignee: Apache Spark  (was: Tathagata Das)

> StateStore.abort() should not throw further exception
> -
>
> Key: SPARK-20716
> URL: https://issues.apache.org/jira/browse/SPARK-20716
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> StateStore.abort() should do a best effort attempt to clean up temporary 
> resources. It should not throw errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20716) StateStore.abort() should not throw further exception

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20716:


Assignee: Tathagata Das  (was: Apache Spark)

> StateStore.abort() should not throw further exception
> -
>
> Key: SPARK-20716
> URL: https://issues.apache.org/jira/browse/SPARK-20716
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> StateStore.abort() should do a best effort attempt to clean up temporary 
> resources. It should not throw errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20716) StateStore.abort() should not throw further exception

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007581#comment-16007581
 ] 

Apache Spark commented on SPARK-20716:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/17958

> StateStore.abort() should not throw further exception
> -
>
> Key: SPARK-20716
> URL: https://issues.apache.org/jira/browse/SPARK-20716
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> StateStore.abort() should do a best effort attempt to clean up temporary 
> resources. It should not throw errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20718) FileSourceScanExec with different filter orders should have the same result

2017-05-11 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-20718:


 Summary: FileSourceScanExec with different filter orders should 
have the same result
 Key: SPARK-20718
 URL: https://issues.apache.org/jira/browse/SPARK-20718
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Zhenhua Wang


Since `constraints` in `QueryPlan` is a set, the order of filters can differ. 
Usually this is ok because of canonicalization. However, in 
`FileSourceScanExec`, its data filters and partition filters are sequences, and 
their orders are not canonicalized. So `def sameResult` returns different 
results for different orders of data/partition filters. This leads to, e.g. 
different decision for `ReuseExchange`, and thus results in unstable 
performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20399) Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser

2017-05-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20399:
---

Assignee: Liang-Chi Hsieh

> Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string 
> in parser
> --
>
> Key: SPARK-20399
> URL: https://issues.apache.org/jira/browse/SPARK-20399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The new SQL parser is introduced into Spark 2.0. Seems it bring an issue 
> regarding the regex pattern string.
> The following codes can reproduce it:
> {code}
> val data = Seq("\u0020\u0021\u0023", "abc")
> val df = data.toDF()
> // 1st usage: works in 1.6
> // Let parser parse pattern string
> val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
> // 2nd usage: works in 1.6, 2.x
> // Call Column.rlike so the pattern string is a literal which doesn't go 
> through parser
> val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
> // In 2.x, we need add backslashes to make regex pattern parsed correctly
> val rlike3 = df.filter("value rlike '^x20[x20-x23]+$'")
> {code}
> Due to unescaping SQL String in parser, the first usage working in 1.6 can't 
> work in 2.0. To make it work, we need to add additional backslashes.
> It is quite weird that we can't use the same regex pattern string in the 2 
> usages. I think we should not unescape regex pattern string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20399) Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser

2017-05-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20399.
-
Resolution: Fixed

> Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string 
> in parser
> --
>
> Key: SPARK-20399
> URL: https://issues.apache.org/jira/browse/SPARK-20399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The new SQL parser is introduced into Spark 2.0. Seems it bring an issue 
> regarding the regex pattern string.
> The following codes can reproduce it:
> {code}
> val data = Seq("\u0020\u0021\u0023", "abc")
> val df = data.toDF()
> // 1st usage: works in 1.6
> // Let parser parse pattern string
> val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
> // 2nd usage: works in 1.6, 2.x
> // Call Column.rlike so the pattern string is a literal which doesn't go 
> through parser
> val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
> // In 2.x, we need add backslashes to make regex pattern parsed correctly
> val rlike3 = df.filter("value rlike '^x20[x20-x23]+$'")
> {code}
> Due to unescaping SQL String in parser, the first usage working in 1.6 can't 
> work in 2.0. To make it work, we need to add additional backslashes.
> It is quite weird that we can't use the same regex pattern string in the 2 
> usages. I think we should not unescape regex pattern string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20399) Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string in parser

2017-05-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20399:

Fix Version/s: 2.2.0

> Can't use same regex pattern between 1.6 and 2.x due to unescaped sql string 
> in parser
> --
>
> Key: SPARK-20399
> URL: https://issues.apache.org/jira/browse/SPARK-20399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The new SQL parser is introduced into Spark 2.0. Seems it bring an issue 
> regarding the regex pattern string.
> The following codes can reproduce it:
> {code}
> val data = Seq("\u0020\u0021\u0023", "abc")
> val df = data.toDF()
> // 1st usage: works in 1.6
> // Let parser parse pattern string
> val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
> // 2nd usage: works in 1.6, 2.x
> // Call Column.rlike so the pattern string is a literal which doesn't go 
> through parser
> val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))
> // In 2.x, we need add backslashes to make regex pattern parsed correctly
> val rlike3 = df.filter("value rlike '^x20[x20-x23]+$'")
> {code}
> Due to unescaping SQL String in parser, the first usage working in 1.6 can't 
> work in 2.0. To make it work, we need to add additional backslashes.
> It is quite weird that we can't use the same regex pattern string in the 2 
> usages. I think we should not unescape regex pattern string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20717) Tweak MapGroupsWithState update function behavior

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20717:


Assignee: Apache Spark  (was: Tathagata Das)

> Tweak MapGroupsWithState update function behavior
> -
>
> Key: SPARK-20717
> URL: https://issues.apache.org/jira/browse/SPARK-20717
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> Timeout and state data are two independent entities and should be settable 
> independently. Therefore, in the same call of the user-defined function, one 
> should be able to set the timeout before initializing the state and also 
> after removing the state. Whether timeouts can be set or not should not 
> depend on the current state, and vice versa. 
> However, a limitation of the current implementation is that state cannot be 
> null while timeout is set. This is checked lazily after the function call has 
> completed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20717) Tweak MapGroupsWithState update function behavior

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20717:


Assignee: Tathagata Das  (was: Apache Spark)

> Tweak MapGroupsWithState update function behavior
> -
>
> Key: SPARK-20717
> URL: https://issues.apache.org/jira/browse/SPARK-20717
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Timeout and state data are two independent entities and should be settable 
> independently. Therefore, in the same call of the user-defined function, one 
> should be able to set the timeout before initializing the state and also 
> after removing the state. Whether timeouts can be set or not should not 
> depend on the current state, and vice versa. 
> However, a limitation of the current implementation is that state cannot be 
> null while timeout is set. This is checked lazily after the function call has 
> completed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20717) Tweak MapGroupsWithState update function behavior

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007573#comment-16007573
 ] 

Apache Spark commented on SPARK-20717:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/17957

> Tweak MapGroupsWithState update function behavior
> -
>
> Key: SPARK-20717
> URL: https://issues.apache.org/jira/browse/SPARK-20717
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Timeout and state data are two independent entities and should be settable 
> independently. Therefore, in the same call of the user-defined function, one 
> should be able to set the timeout before initializing the state and also 
> after removing the state. Whether timeouts can be set or not should not 
> depend on the current state, and vice versa. 
> However, a limitation of the current implementation is that state cannot be 
> null while timeout is set. This is checked lazily after the function call has 
> completed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20717) Tweak MapGroupsWithState update function behavior

2017-05-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-20717:
--
Description: 
Timeout and state data are two independent entities and should be settable 
independently. Therefore, in the same call of the user-defined function, one 
should be able to set the timeout before initializing the state and also after 
removing the state. Whether timeouts can be set or not should not depend on the 
current state, and vice versa. 

However, a limitation of the current implementation is that state cannot be 
null while timeout is set. This is checked lazily after the function call has 
completed.



  was:
Timeout and state data are two independent entities and should be settable 
independently. Therefore, 

- In the same call of the user-defined function, one should be able to set the 
timeout before initializing the state. 
- Removing the state should not reset timeouts.

However, a limitation of the current implementation is that state cannot be 
null while timeout is set by the end of the function call is over. We should 
check this lazily.
 


> Tweak MapGroupsWithState update function behavior
> -
>
> Key: SPARK-20717
> URL: https://issues.apache.org/jira/browse/SPARK-20717
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Timeout and state data are two independent entities and should be settable 
> independently. Therefore, in the same call of the user-defined function, one 
> should be able to set the timeout before initializing the state and also 
> after removing the state. Whether timeouts can be set or not should not 
> depend on the current state, and vice versa. 
> However, a limitation of the current implementation is that state cannot be 
> null while timeout is set. This is checked lazily after the function call has 
> completed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20608) Standby namenodes should be allowed to included in yarn.spark.access.namenodes to support HDFS HA

2017-05-11 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007561#comment-16007561
 ] 

Yuechen Chen commented on SPARK-20608:
--

I try this solution, but meet with some problems.
I configging dfs.nameservices in hdfs-site.xml in my test machine, and hadoop 
client works: hdfs dfs -ls hdfs://mycluster/path
But by spark-submit, it failed by following exception.
17/05/12 10:33:57 INFO Client: Submitting application 
application_1487208985618_23772 to ResourceManager
17/05/12 10:33:59 INFO Client: Application report for 
application_1487208985618_23772 (state: FAILED)
17/05/12 10:33:59 INFO Client: 
 client token: N/A
 diagnostics: Unable to map logical nameservice URI 'hdfs://mycluster' 
to a NameNode. Local configuration does not have a failover proxy provider 
configured.
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
Should make the same nameservices configged in YARN, which means the remote 
nameservice should also config in resource manager in YARN?
I'm not so clearly about that.
Since putting the namespace address is the only recommended solution to support 
HDFS, may someone solve this problem(if it's a bug) or give some examples in 
SPARK wiki?

> Standby namenodes should be allowed to included in 
> yarn.spark.access.namenodes to support HDFS HA
> -
>
> Key: SPARK-20608
> URL: https://issues.apache.org/jira/browse/SPARK-20608
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Yuechen Chen
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> If one Spark Application need to access remote namenodes, 
> yarn.spark.access.namenodes should be only be configged in spark-submit 
> scripts, and Spark Client(On Yarn) would fetch HDFS credential periodically.
> If one hadoop cluster is configured by HA, there would be one active namenode 
> and at least one standby namenode. 
> However, if yarn.spark.access.namenodes includes both active and standby 
> namenodes, Spark Application will be failed for the reason that the standby 
> namenode would not access by Spark for org.apache.hadoop.ipc.StandbyException.
> I think it won't cause any bad effect to config standby namenodes in 
> yarn.spark.access.namenodes, and my Spark Application can be able to sustain 
> the failover of Hadoop namenode.
> HA Examples:
> Spark-submit script: 
> yarn.spark.access.namenodes=hdfs://namenode01,hdfs://namenode02
> Spark Application Codes:
> dataframe.write.parquet(getActiveNameNode(...) + hdfsPath)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18772) Parsing JSON with some NaN and Infinity values throws NumberFormatException

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007556#comment-16007556
 ] 

Apache Spark commented on SPARK-18772:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17956

> Parsing JSON with some NaN and Infinity values throws NumberFormatException
> ---
>
> Key: SPARK-18772
> URL: https://issues.apache.org/jira/browse/SPARK-18772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> JacksonParser tests for infinite and NaN values in a way that is not 
> supported by the underlying float/double parser. For example, the input 
> string is always lowercased to check for {{-Infinity}} but the parser only 
> supports titlecased values. So a {{-infinitY}} will pass the test but fail 
> with a {{NumberFormatException}} when parsing. This exception is not caught 
> anywhere and the task ends up failing.
> A related issue is that the code checks for {{Inf}} but the parser only 
> supports the long form of {{Infinity}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20717) Tweak MapGroupsWithState update function behavior

2017-05-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-20717:
-

 Summary: Tweak MapGroupsWithState update function behavior
 Key: SPARK-20717
 URL: https://issues.apache.org/jira/browse/SPARK-20717
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Timeout and state data are two independent entities and should be settable 
independently. Therefore, 

- In the same call of the user-defined function, one should be able to set the 
timeout before initializing the state. 
- Removing the state should not reset timeouts.

However, a limitation of the current implementation is that state cannot be 
null while timeout is set by the end of the function call is over. We should 
check this lazily.
 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20716) StateStore.abort() should not throw further exception

2017-05-11 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-20716:
--
Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-19067)

> StateStore.abort() should not throw further exception
> -
>
> Key: SPARK-20716
> URL: https://issues.apache.org/jira/browse/SPARK-20716
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> StateStore.abort() should do a best effort attempt to clean up temporary 
> resources. It should not throw errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2017-05-11 Thread Jeremy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007535#comment-16007535
 ] 

Jeremy commented on SPARK-10408:


I mentioned this in the PR, but I also want to mention here that I'll do a code 
review within the next few days. 

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20716) StateStore.abort() should not throw further exception

2017-05-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-20716:
-

 Summary: StateStore.abort() should not throw further exception
 Key: SPARK-20716
 URL: https://issues.apache.org/jira/browse/SPARK-20716
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Tathagata Das
Assignee: Tathagata Das


StateStore.abort() should do a best effort attempt to clean up temporary 
resources. It should not throw errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20665) Spark-sql, "Bround" and "Round" function return NULL

2017-05-11 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-20665:

Affects Version/s: 2.0.0
   2.1.0

> Spark-sql, "Bround" and "Round" function return NULL
> 
>
> Key: SPARK-20665
> URL: https://issues.apache.org/jira/browse/SPARK-20665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: liuxian
>
> >select bround(12.3, 2);
> >NULL
> For  this case, the expected result is 12.3, but it is null
> "Round" has the same problem:
> >select round(12.3, 2);
> >NULL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13486) Move SQLConf into an internal package

2017-05-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007428#comment-16007428
 ] 

Reynold Xin commented on SPARK-13486:
-

Why is this troubling? SQLConf was previously package visible and cannot be 
accessed from outside. After the move it is actually visible (with the package 
name internal to signal that it is an internal class).


> Move SQLConf into an internal package
> -
>
> Key: SPARK-13486
> URL: https://issues.apache.org/jira/browse/SPARK-13486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> To improve project structure, it would be better if the top level packages 
> contain only public classes. For private ones such as SQLConf, we can move 
> them into org.apache.spark.sql.internal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20715:


Assignee: Josh Rosen  (was: Apache Spark)

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20715:


Assignee: Apache Spark  (was: Josh Rosen)

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007351#comment-16007351
 ] 

Apache Spark commented on SPARK-20715:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17955

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2017-05-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-20715:
--

 Summary: MapStatuses shouldn't be redundantly stored in both 
ShuffleMapStage and MapOutputTracker
 Key: SPARK-20715
 URL: https://issues.apache.org/jira/browse/SPARK-20715
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Shuffle
Affects Versions: 2.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen


Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
of MapStatuses. This creates the potential for bugs in case these two pieces of 
state become out of sync.

I believe that we can improve our ability to reason about the code by storing 
this information only in the MapOutputTracker. This can also help to reduce 
driver memory consumption.

I will provide more details in my PR, where I'll walk through the detailed 
arguments as to why we can take these two different metadata tracking formats 
and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20714) Fix match error when watermark is set with timeout = no timeout / processing timeout

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20714:


Assignee: Apache Spark  (was: Tathagata Das)

> Fix match error when watermark is set with timeout = no timeout / processing 
> timeout
> 
>
> Key: SPARK-20714
> URL: https://issues.apache.org/jira/browse/SPARK-20714
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> When watermark is set, and timeout conf is NoTimeout or ProcessingTimeTimeout 
> (both do not need the watermark), the query fails at runtime with the 
> following exception.
> {code}
> MatchException: 
> Some(org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate@1a9b798e)
>  (of class scala.Some)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:120)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:116)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
> 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20714) Fix match error when watermark is set with timeout = no timeout / processing timeout

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20714:


Assignee: Tathagata Das  (was: Apache Spark)

> Fix match error when watermark is set with timeout = no timeout / processing 
> timeout
> 
>
> Key: SPARK-20714
> URL: https://issues.apache.org/jira/browse/SPARK-20714
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> When watermark is set, and timeout conf is NoTimeout or ProcessingTimeTimeout 
> (both do not need the watermark), the query fails at runtime with the 
> following exception.
> {code}
> MatchException: 
> Some(org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate@1a9b798e)
>  (of class scala.Some)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:120)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:116)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
> 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20714) Fix match error when watermark is set with timeout = no timeout / processing timeout

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007306#comment-16007306
 ] 

Apache Spark commented on SPARK-20714:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/17954

> Fix match error when watermark is set with timeout = no timeout / processing 
> timeout
> 
>
> Key: SPARK-20714
> URL: https://issues.apache.org/jira/browse/SPARK-20714
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> When watermark is set, and timeout conf is NoTimeout or ProcessingTimeTimeout 
> (both do not need the watermark), the query fails at runtime with the 
> following exception.
> {code}
> MatchException: 
> Some(org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate@1a9b798e)
>  (of class scala.Some)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:120)
> 
> org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:116)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
> 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
> 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13210) NPE in Sort

2017-05-11 Thread David McWhorter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007303#comment-16007303
 ] 

David McWhorter commented on SPARK-13210:
-

is it worth reopening this issue, or creating a new one to track?

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20714) Fix match error when watermark is set with timeout = no timeout / processing timeout

2017-05-11 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-20714:
-

 Summary: Fix match error when watermark is set with timeout = no 
timeout / processing timeout
 Key: SPARK-20714
 URL: https://issues.apache.org/jira/browse/SPARK-20714
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Tathagata Das
Assignee: Tathagata Das


When watermark is set, and timeout conf is NoTimeout or ProcessingTimeTimeout 
(both do not need the watermark), the query fails at runtime with the following 
exception.


{code}
MatchException: 
Some(org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate@1a9b798e)
 (of class scala.Some)

org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:120)

org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec$$anonfun$doExecute$1.apply(FlatMapGroupsWithStateExec.scala:116)

org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)

org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)

org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20666:
-
Target Version/s: 2.2.0

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Critical
>
> seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in 
> other test runs too, always only when running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
>   at 

[jira] [Updated] (SPARK-20483) Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20483:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Mesos Coarse mode may starve other Mesos frameworks if max cores is not a 
> multiple of executor cores
> 
>
> Key: SPARK-20483
> URL: https://issues.apache.org/jira/browse/SPARK-20483
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Davis Shepherd
>Assignee: Davis Shepherd
>Priority: Minor
> Fix For: 2.2.0
>
>
> if {{spark.cores.max = 10}} for example and {{spark.executor.cores = 4}}, 2 
> executors will get launched thus {{totalCoresAcquired = 8}}. All future Mesos 
> offers will not get tasks launched because 
> {{sc.conf.getInt("spark.executor.cores", ...) + totalCoresAcquired <= 
> maxCores}} will always evaluate to false.  However, in 
> {{handleMatchedOffers}} we check if {{totalCoresAcquired >= maxCores}} to 
> determine if we should decline the offer "for a configurable amount of time 
> to avoid starving other frameworks", and this will always evaluate to false 
> in the above scenario. This leaves the framework in a state of limbo where it 
> will never launch any new executors, but only decline offers for the Mesos 
> default of 5 seconds, thus starving other frameworks of offers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14471:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> The alias created in SELECT could be used in GROUP BY and followed expressions
> --
>
> Key: SPARK-14471
> URL: https://issues.apache.org/jira/browse/SPARK-14471
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> This query should be able to run:
> {code}
> select a a1, a1 + 1 as b, count(1) from t group by a1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20426) OneForOneStreamManager occupies too much memory.

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20426:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> OneForOneStreamManager occupies too much memory.
> 
>
> Key: SPARK-20426
> URL: https://issues.apache.org/jira/browse/SPARK-20426
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Spark jobs are running on yarn cluster in my warehouse. We enabled the 
> external shuffle service(*--conf spark.shuffle.service.enabled=true*). 
> Recently NodeManager runs OOM now and then. Dumping heap memory, we find that 
> *OneFroOneStreamManager*'s footprint is huge. NodeManager is configured with 
> 5G heap memory. While *OneForOneManager* costs 2.5G and there are 5503233 
> *FileSegmentManagedBuffer* objects. Is there any suggestions to avoid this 
> other than just keep increasing NodeManager's memory? Is it possible to stop 
> *registerStream* in OneForOneStreamManager? Thus we don't need to cache so 
> many metadatas(i.e. StreamState).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20482) Resolving Casts is too strict on having time zone set

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20482:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Resolving Casts is too strict on having time zone set
> -
>
> Key: SPARK-20482
> URL: https://issues.apache.org/jira/browse/SPARK-20482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kris Mok
>Assignee: Kris Mok
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20471:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled 
> but vectorized hashmap is enabled.
> -
>
> Key: SPARK-20471
> URL: https://issues.apache.org/jira/browse/SPARK-20471
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>Assignee: caoxuewen
> Fix For: 2.2.0
>
>
> remove  AggregateBenchmark testsuite warning:
> such as '14:26:33.220 WARN 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap 
> is disabled but vectorized hashmap is enabled.'
> unit tests: AggregateBenchmark 
> Modify the 'ignore function for 'test funtion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20476) Exception between "create table as" and "get_json_object"

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20476:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Exception between "create table as" and "get_json_object"
> -
>
> Key: SPARK-20476
> URL: https://issues.apache.org/jira/browse/SPARK-20476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> I encounter this problem when I want to create a table as select , 
> get_json_object from xxx;
> It is wrong.
> {code}
> create table spark_json_object as
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok.
> {code}
> create table spark_json_object as
> select *
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> It is ok
> {code}
> select get_json_object(deliver_geojson,'$.')
> from dw.dw_prd_order where dt='2017-04-24' limit 10;
> {code}
> {code}
> 17/04/26 23:12:56 ERROR [hive.log(397) -- main]: error in initSerDe: 
> org.apache.hadoop.hive.serde2.SerDeException 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 2 elements 
> while columns.types has 1 elements!
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:146)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:197)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:699)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:455)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:309)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:454)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:237)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:199)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:248)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$lzycompute$1(CreateHiveTableAsSelectCommand.scala:72)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.metastoreRelation$1(CreateHiveTableAsSelectCommand.scala:48)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:91)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
> at org.apache.spark.sql.Dataset.(Dataset.scala:179)
> at 

[jira] [Updated] (SPARK-20047) Constrained Logistic Regression

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20047:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Constrained Logistic Regression
> ---
>
> Key: SPARK-20047
> URL: https://issues.apache.org/jira/browse/SPARK-20047
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Yanbo Liang
> Fix For: 2.2.0
>
>
> For certain applications, such as stacked regressions, it is important to put 
> non-negative constraints on the regression coefficients. Also, if the ranges 
> of coefficients are known, it makes sense to constrain the coefficient search 
> space.
> Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ 
> R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a 
> set of m linear constraints on the coefficients is very challenging as 
> discussed in many literatures. 
> However, for box constraints on the coefficients, the optimization is well 
> solved. For gradient descent, people can projected gradient descent in the 
> primal by zeroing the negative weights at each step. For LBFGS, an extended 
> version of it, LBFGS-B can handle large scale box optimization efficiently. 
> Unfortunately, for OWLQN, there is no good efficient way to do optimization 
> with box constrains.
> As a result, in this work, we only implement constrained LR with box 
> constrains without L1 regularization. 
> Note that since we standardize the data in training phase, so the 
> coefficients seen in the optimization subroutine are in the scaled space; as 
> a result, we need to convert the box constrains into scaled space.
> Users will be able to set the lower / upper bounds of each coefficients and 
> intercepts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20517) Download link in history server UI is not correct

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20517:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Download link in history server UI is not correct
> -
>
> Key: SPARK-20517
> URL: https://issues.apache.org/jira/browse/SPARK-20517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> The download link in history server UI is concatenated with:
> {code}
>class="btn btn-info btn-mini">Download
> {code}
> Here {{num}} filed represents number of attempts, this is not equal to REST 
> APIs. In the REST API, if attempt id is not existed, then {{num}} field 
> should be empty, otherwise this {{num}} field should actually be 
> {{attemptId}}.
> This will lead to the issue of "no such app", rather than correctly download 
> the event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20549) java.io.CharConversionException: Invalid UTF-32 in JsonToStructs

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20549:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> java.io.CharConversionException: Invalid UTF-32 in JsonToStructs
> 
>
> Key: SPARK-20549
> URL: https://issues.apache.org/jira/browse/SPARK-20549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.2.0
>
>
> The same fix for SPARK-16548 needs to be applied for JsonToStructs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20459:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> JdbcUtils throws IllegalStateException: Cause already initialized after 
> getting SQLException
> 
>
> Key: SPARK-20459
> URL: https://issues.apache.org/jira/browse/SPARK-20459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Jessie Yu
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.0
>
>
> Testing some failure scenarios, and JdbcUtils throws an IllegalStateException 
> instead of the expected SQLException:
> {code}
> scala> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1)
>  
> 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)  
>   
> java.lang.IllegalStateException: Cause already initialized
>   
> .at java.lang.Throwable.setCause(Throwable.java:365)  
>   
> .at java.lang.Throwable.initCause(Throwable.java:341) 
>   
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>  
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   
> .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   
> .at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   
> .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> .at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153
> .at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628
> .at java.lang.Thread.run(Thread.java:785) 
>   
> {code}
> The code in JdbcUtils.savePartition has 
> {code}
> } catch {
>   case e: SQLException =>
> val cause = e.getNextException
> if (cause != null && e.getCause != cause) {
>   if (e.getCause == null) {
> e.initCause(cause)
>   } else {
> e.addSuppressed(cause)
>   }
> }
> {code}
> According to Throwable Java doc, {{initCause()}} throws an 
> {{IllegalStateException}} "if this throwable was created with 
> Throwable(Throwable) or Throwable(String,Throwable), or this method has 
> already been called on this throwable". The code does check whether {{cause}} 
> is {{null}} before initializing it. However, {{getCause()}} "returns the 
> cause of this throwable or null if the cause is nonexistent or unknown." In 
> other words, {{null}} is returned if {{cause}} already exists (which would 
> result in {{IllegalStateException}}) but is unknown. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20534) Outer generators skip missing records if used alone

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20534:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Outer generators skip missing records if used alone
> ---
>
> Key: SPARK-20534
> URL: https://issues.apache.org/jira/browse/SPARK-20534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: master 814a61a867ded965433c944c90961df529ac83ab
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Example data:
> {code}
> val df = Seq(
>   (1, Some("a" :: "b" :: "c" :: Nil)), 
>   (2, None), 
>   (3, Some("a" :: Nil)
> )).toDF("k", "vs")
> {code}
> Correct behavior if there are other expressions:
> {code}
> df.select($"k", explode_outer($"vs")).show
> // +---++
> // |  k| col|
> // +---++
> // |  1|   a|
> // |  1|   b|
> // |  1|   c|
> // |  2|null|
> // |  3|   a|
> // +---++
> df.select($"k", posexplode_outer($"vs")).show
> // +---+++
> // |  k| pos| col|
> // +---+++
> // |  1|   0|   a|
> // |  1|   1|   b|
> // |  1|   2|   c|
> // |  2|null|null|
> // |  3|   0|   a|
> // +---+++
> {code}
> Incorrect behavior if used alone:
> {code}
> df.select(explode_outer($"vs")).show
> // +---+
> // |col|
> // +---+
> // |  a|
> // |  b|
> // |  c|
> // |  a|
> // +---+
> df.select(posexplode_outer($"vs")).show
> // +---+---+
> // |pos|col|
> // +---+---+
> // |  0|  a|
> // |  1|  b|
> // |  2|  c|
> // |  0|  a|
> // +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20537) OffHeapColumnVector reallocation may not copy existing data

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20537:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> OffHeapColumnVector reallocation may not copy existing data
> ---
>
> Key: SPARK-20537
> URL: https://issues.apache.org/jira/browse/SPARK-20537
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> As SPARK-20474 revealed, reallocation in {{OnHeapColumnVector}} may copy a 
> part of the original storage.
> {{OffHeapColumnVector}} reallocation also copies to the new storage data up 
> to {{elementsAppended}}. This variable is only updated when using the 
> ColumnVector.appendX API, while ColumnVector.putX is more commonly used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11968:

Fix Version/s: (was: 2.2.1)
   2.2.0

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Peng Meng
> Fix For: 2.2.0
>
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20621) Delete deprecated config parameter in 'spark-env.sh'

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20621:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Delete deprecated config parameter in 'spark-env.sh'
> 
>
> Key: SPARK-20621
> URL: https://issues.apache.org/jira/browse/SPARK-20621
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: coneyliu
>Assignee: coneyliu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, `spark.executor.instances` is deprecated in `spark-env.sh`, 
> because we suggest config it in `spark-defaults.conf` or other config file. 
> And also this parameter is useless even if you set it in `spark-env.sh`, so 
> remove it in this patch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20546:

Fix Version/s: (was: 2.2.1)
   2.2.0

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Assignee: Jessie Yu
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20421:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.2.0
>
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20558) clear InheritableThreadLocal variables in SparkContext when stopping it

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20558:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> clear InheritableThreadLocal variables in SparkContext when stopping it
> ---
>
> Key: SPARK-20558
> URL: https://issues.apache.org/jira/browse/SPARK-20558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20596:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Improve ALS recommend all test cases
> 
>
> Key: SPARK-20596
> URL: https://issues.apache.org/jira/browse/SPARK-20596
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> Existing test cases for `recommendForAllX` methods in SPARK-19535 test {{k}} 
> < num items and {{k}} = num items. Technically we should also test that {{k}} 
> > num items returns the same results as {{k}} = num items.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20587:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Improve performance of ML ALS recommendForAll
> -
>
> Key: SPARK-20587
> URL: https://issues.apache.org/jira/browse/SPARK-20587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
> Fix For: 2.2.0
>
>
> SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
> approach for generating top-k recommendations in 
> {{mllib.recommendation.MatrixFactorizationModel}}.
> The solution there is still based on blocking factors, but efficiently 
> computes the top-k elements *per block* first (using 
> {{BoundedPriorityQueue}}) and then computes the global top-k elements.
> This improves performance and GC pressure substantially for {{mllib}}'s ALS 
> model. The same approach is also a lot more efficient than the current 
> "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. 
> This adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20667) Cleanup the cataloged metadata after completing the package of sql/core and sql/hive

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20667:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Cleanup the cataloged metadata after completing the package of sql/core and 
> sql/hive
> 
>
> Key: SPARK-20667
> URL: https://issues.apache.org/jira/browse/SPARK-20667
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> So far, we do not drop all the cataloged tables after each package. 
> Sometimes, we might hit strange test case errors because the previous test 
> suite did not drop the tables/functions/database. At least, we can first 
> clean up the environment when completing the package of sql/core and sql/hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20627) Remove pip local version string (PEP440)

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20627:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Remove pip local version string (PEP440)
> 
>
> Key: SPARK-20627
> URL: https://issues.apache.org/jira/browse/SPARK-20627
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.1.2, 2.2.0
>
>
> In make distribution script right now we append the hadoop version string, 
> but this makes uploading to PyPI difficult and we don't cross-build for 
> multiple versions anymore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20548:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20615) SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector has a size greater than zero but no elements defined.

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20615:

Fix Version/s: (was: 2.2.1)
   2.2.0

> SparseVector.argmax throws IndexOutOfBoundsException when the sparse vector 
> has a size greater than zero but no elements defined.
> -
>
> Key: SPARK-20615
> URL: https://issues.apache.org/jira/browse/SPARK-20615
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Jon McLean
>Assignee: Jon McLean
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> org.apache.spark.ml.linalg.SparseVector.argmax throws an 
> IndexOutOfRangeException when the vector size is greater than zero and no 
> values are defined.  The toString() representation of such a vector is " 
> (10,[],[])".  This is because the argmax function tries to get the value 
> at indexes(0) without checking the size of the array.
> Code inspection reveals that the mllib version of SparseVector should have 
> the same issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20373:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 

[jira] [Resolved] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20373.
-
Resolution: Fixed

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 

[jira] [Updated] (SPARK-20686) PropagateEmptyRelation incorrectly handles aggregate without grouping expressions

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20686:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> PropagateEmptyRelation incorrectly handles aggregate without grouping 
> expressions
> -
>
> Key: SPARK-20686
> URL: https://issues.apache.org/jira/browse/SPARK-20686
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: correctness
> Fix For: 2.1.2, 2.2.0
>
>
> The query
> {code}
> SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
> {code}
> should return a single row of output because the subquery is an aggregate 
> without a group-by and thus should return a single row. However, Spark 
> incorrectly returns zero rows.
> This is caused by SPARK-16208, a patch which added an optimizer rule to 
> propagate EmptyRelation through operators. The logic for handling aggregates 
> is wrong: it checks whether aggregate expressions are non-empty for deciding 
> whether the output should be empty, whereas it should be checking grouping 
> expressions instead:
> An aggregate with non-empty group expression will return one output row per 
> group. If the input to the grouped aggregate is empty then all groups will be 
> empty and thus the output will be empty. It doesn't matter whether the SELECT 
> statement includes aggregate expressions since that won't affect the number 
> of output rows.
> If the grouping expressions are empty, however, then the aggregate will 
> always produce a single output row and thus we cannot propagate the 
> EmptyRelation.
> The current implementation is incorrect (since it returns a wrong answer) and 
> also misses an optimization opportunity by not propagating EmptyRelation in 
> the case where a grouped aggregate has aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17685) WholeStageCodegenExec throws IndexOutOfBoundsException

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17685:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> WholeStageCodegenExec throws IndexOutOfBoundsException
> --
>
> Key: SPARK-17685
> URL: https://issues.apache.org/jira/browse/SPARK-17685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> The following SQL query reproduces this issue:
> {code:sql}
> CREATE TABLE tab1(int int, int2 int, str string);
> CREATE TABLE tab2(int int, int2 int, str string);
> INSERT INTO tab1 values(1,1,'str');
> INSERT INTO tab1 values(2,2,'str');
> INSERT INTO tab2 values(1,1,'str');
> INSERT INTO tab2 values(2,3,'str');
> SELECT
>   count(*)
> FROM
>   (
> SELECT t1.int, t2.int2 
> FROM (SELECT * FROM tab1 LIMIT 1310721) t1
> INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2 
> ON (t1.int = t2.int AND t1.int2 = t2.int2)
>   ) t;
> {code}
> Exception thrown:
> {noformat}
> java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$createJoinKey$1.apply(SortMergeJoinExec.scala:334)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$createJoinKey$1.apply(SortMergeJoinExec.scala:334)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.createJoinKey(SortMergeJoinExec.scala:334)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.genScanner(SortMergeJoinExec.scala:369)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doProduce(SortMergeJoinExec.scala:512)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.produce(SortMergeJoinExec.scala:35)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:30)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithoutKeys(HashAggregateExec.scala:215)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:143)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> 

[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12837:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20678) Ndv for columns not in filter condition should also be updated

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20678:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Ndv for columns not in filter condition should also be updated
> --
>
> Key: SPARK-20678
> URL: https://issues.apache.org/jira/browse/SPARK-20678
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> In filter estimation, we update column stats for those columns in filter 
> condition. However, if the number of rows decreases after the filter (i.e. 
> the overall selectivity is less than 1), we need to update (scale down) the 
> number of distinct values (NDV) for all columns, no matter they are in filter 
> conditions or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20590:

Fix Version/s: (was: 2.3.0)

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20630:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20600:

Fix Version/s: (was: 2.3.0)

> KafkaRelation should be pretty printed in web UI (Details for Query)
> 
>
> Key: SPARK-20600
> URL: https://issues.apache.org/jira/browse/SPARK-20600
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.2.0
>
> Attachments: kafka-source-scan-webui.png
>
>
> Executing the following batch query gives the default stringified/internal 
> name of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
> http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
> attachment.
> {code}
> spark.
>   read.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   select('value cast "string").
>   write.
>   csv("fromkafka.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20688) correctly check analysis for scalar sub-queries

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20688:

Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> correctly check analysis for scalar sub-queries
> ---
>
> Key: SPARK-20688
> URL: https://issues.apache.org/jira/browse/SPARK-20688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.2, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20637) MappedRDD, FilteredRDD, etc. are still referenced in code comments

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20637:

Fix Version/s: (was: 2.2.1)
   2.2.0

> MappedRDD, FilteredRDD, etc. are still referenced in code comments
> --
>
> Key: SPARK-20637
> URL: https://issues.apache.org/jira/browse/SPARK-20637
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
> Fix For: 2.2.0
>
>
> There are only a couple instances of this, but it would be helpful to have 
> things updated to current references.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20569:

Fix Version/s: (was: 2.2.1)

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: Wenchen Fan
>Priority: Trivial
> Fix For: 2.2.0, 2.3.0
>
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20569:

Fix Version/s: 2.2.1

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: Wenchen Fan
>Priority: Trivial
> Fix For: 2.2.0, 2.3.0
>
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20600:

Fix Version/s: (was: 2.2.1)
   2.2.0

> KafkaRelation should be pretty printed in web UI (Details for Query)
> 
>
> Key: SPARK-20600
> URL: https://issues.apache.org/jira/browse/SPARK-20600
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.2.0, 2.3.0
>
> Attachments: kafka-source-scan-webui.png
>
>
> Executing the following batch query gives the default stringified/internal 
> name of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
> http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
> attachment.
> {code}
> spark.
>   read.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   select('value cast "string").
>   write.
>   csv("fromkafka.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20685:

Fix Version/s: (was: 2.3.0)

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with 
> repeated argument
> ---
>
> Key: SPARK-20685
> URL: https://issues.apache.org/jira/browse/SPARK-20685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.2, 2.2.0
>
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
> stage with a single UDF that takes more than one argument _where that 
> argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 180, in main
> process()
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 175, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 107, in 
> func = lambda _, it: map(mapper, it)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 93, in 
> mapper = lambda a: udf(*a)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 71, in 
> return lambda *a: f(*a)
> TypeError: () takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path 
> for handling a "batch UDF evaluation consisting of a single Python UDF, but 
> that branch incorrectly assumes that a single UDF won't have repeated 
> arguments and therefore skips the code for unpacking arguments from the input 
> row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20685) BatchPythonEvaluation UDF evaluator fails for case of single UDF with repeated argument

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20685:

Fix Version/s: (was: 2.2.1)
   2.2.0

> BatchPythonEvaluation UDF evaluator fails for case of single UDF with 
> repeated argument
> ---
>
> Key: SPARK-20685
> URL: https://issues.apache.org/jira/browse/SPARK-20685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.2, 2.2.0
>
>
> There's a latent corner-case bug in PYSpark UDF evaluation where executing a 
> stage with a single UDF that takes more than one argument _where that 
> argument is repeated_ will crash at execution with a confusing error.
> Here's a repro:
> {code}
> from pyspark.sql.types import *
> spark.catalog.registerFunction("add", lambda x, y: x + y, IntegerType())
> spark.sql("SELECT add(1, 1)").first()
> {code}
> This fails with
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 180, in main
> process()
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 175, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 107, in 
> func = lambda _, it: map(mapper, it)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 93, in 
> mapper = lambda a: udf(*a)
>   File 
> "/Users/joshrosen/Documents/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 71, in 
> return lambda *a: f(*a)
> TypeError: () takes exactly 2 arguments (1 given)
> {code}
> The problem was introduced by SPARK-14267: there code there has a fast path 
> for handling a "batch UDF evaluation consisting of a single Python UDF, but 
> that branch incorrectly assumes that a single UDF won't have repeated 
> arguments and therefore skips the code for unpacking arguments from the input 
> row (whose schema may not necessarily match the UDF inputs).
> I have a simple fix for this which I'll submit now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20590:

Fix Version/s: (was: 2.2.1)
   2.2.0

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0, 2.3.0
>
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20569:

Fix Version/s: (was: 2.2.1)
   2.2.0

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Assignee: Wenchen Fan
>Priority: Trivial
> Fix For: 2.2.0, 2.3.0
>
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix

2017-05-11 Thread Ignacio Bermudez Corrales (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Bermudez Corrales updated SPARK-20687:
--
Summary: mllib.Matrices.fromBreeze may crash when converting from Breeze 
sparse matrix  (was: mllib.Matrices.fromBreeze may crash when converting breeze 
CSCMatrix)

> mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix
> -
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-05-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006951#comment-16006951
 ] 

Felix Cheung commented on SPARK-20512:
--

I did a QA pass on the R vignettes

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20431) Support a DDL-formatted string in DataFrameReader.schema

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20431.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Support a DDL-formatted string in DataFrameReader.schema
> 
>
> Key: SPARK-20431
> URL: https://issues.apache.org/jira/browse/SPARK-20431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> This ticket targets supporting a DDL-formatted string in 
> `DataFrameReader.schema`. If we could specify a schema by a string, we would 
> need not import `o.a.spark.sql.types._`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20600) KafkaRelation should be pretty printed in web UI (Details for Query)

2017-05-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20600.
--
   Resolution: Fixed
 Assignee: Jacek Laskowski
Fix Version/s: 2.3.0
   2.2.1

> KafkaRelation should be pretty printed in web UI (Details for Query)
> 
>
> Key: SPARK-20600
> URL: https://issues.apache.org/jira/browse/SPARK-20600
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.2.1, 2.3.0
>
> Attachments: kafka-source-scan-webui.png
>
>
> Executing the following batch query gives the default stringified/internal 
> name of {{KafkaRelation}} in web UI (under Details for Query), i.e. 
> http://localhost:4040/SQL/execution/?id=3 (<-- change the {{id}}). See the 
> attachment.
> {code}
> spark.
>   read.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   select('value cast "string").
>   write.
>   csv("fromkafka.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20033) spark sql can not use hive permanent function

2017-05-11 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-20033.
-
Resolution: Not A Problem

Marking this JIRA as resolved, accordingly.

> spark sql can not use hive permanent function
> -
>
> Key: SPARK-20033
> URL: https://issues.apache.org/jira/browse/SPARK-20033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> {code}
> spark-sql> SELECT concat_all_ws('-', *) from det.result_set where 
> job_id='1028448' limit 10;
> Error in query: Undefined function: 'concat_all_ws'. This function is neither 
> a registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20033) spark sql can not use hive permanent function

2017-05-11 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006870#comment-16006870
 ] 

Mark Grover commented on SPARK-20033:
-

Reading the PR, it seems like this is not an issue. The related issue (of 
allowing adding jars from HDFS, in Spark) in SPARK-12868 was fixed in Spark 2.2.

> spark sql can not use hive permanent function
> -
>
> Key: SPARK-20033
> URL: https://issues.apache.org/jira/browse/SPARK-20033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> {code}
> spark-sql> SELECT concat_all_ws('-', *) from det.result_set where 
> job_id='1028448' limit 10;
> Error in query: Undefined function: 'concat_all_ws'. This function is neither 
> a registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20687:


Assignee: (was: Apache Spark)

> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20687:


Assignee: Apache Spark

> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Assignee: Apache Spark
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20687) mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix

2017-05-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006804#comment-16006804
 ] 

Apache Spark commented on SPARK-20687:
--

User 'ghoto' has created a pull request for this issue:
https://github.com/apache/spark/pull/17940

> mllib.Matrices.fromBreeze may crash when converting breeze CSCMatrix
> 
>
> Key: SPARK-20687
> URL: https://issues.apache.org/jira/browse/SPARK-20687
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Ignacio Bermudez Corrales
>Priority: Minor
>
> Conversion of Breeze sparse matrices to Matrix is broken when matrices are 
> product of certain operations. This problem I think is caused by the update 
> method in Breeze CSCMatrix when they add provisional zeros to the data for 
> efficiency.
> This bug is serious and may affect at least BlockMatrix addition and 
> substraction
> http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add/43883458#43883458
> The following code, reproduces the bug (Check test("breeze conversion bug"))
> https://github.com/ghoto/spark/blob/test-bug/CSCMatrixBreeze/mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
> {code:title=MatricesSuite.scala|borderStyle=solid}
>   test("breeze conversion bug") {
> // (2, 0, 0)
> // (2, 0, 0)
> val mat1Brz = Matrices.sparse(2, 3, Array(0, 2, 2, 2), Array(0, 1), 
> Array(2, 2)).asBreeze
> // (2, 1E-15, 1E-15)
> // (2, 1E-15, 1E-15
> val mat2Brz = Matrices.sparse(2, 3, Array(0, 2, 4, 6), Array(0, 0, 0, 1, 
> 1, 1), Array(2, 1E-15, 1E-15, 2, 1E-15, 1E-15)).asBreeze
> // The following shouldn't break
> val t01 = mat1Brz - mat1Brz
> val t02 = mat2Brz - mat2Brz
> val t02Brz = Matrices.fromBreeze(t02)
> val t01Brz = Matrices.fromBreeze(t01)
> val t1Brz = mat1Brz - mat2Brz
> val t2Brz = mat2Brz - mat1Brz
> // The following ones should break
> val t1 = Matrices.fromBreeze(t1Brz)
> val t2 = Matrices.fromBreeze(t2Brz)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20416) Column names inconsistent for UDFs in SQL vs Dataset

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-20416.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Column names inconsistent for UDFs in SQL vs Dataset
> 
>
> Key: SPARK-20416
> URL: https://issues.apache.org/jira/browse/SPARK-20416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> As you can see below, the name of the columns in SQL vs Dataset is different.
> {code}
> scala> val timesTwoUDF = spark.udf.register("timesTwo", (x: Int) => x * 2)
> timesTwoUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,IntegerType,Some(List(IntegerType)))
> scala> spark.sql("SELECT timesTwo(1)").show
> +---+
> |UDF:timesTwo(1)|
> +---+
> |  2|
> +---+
> scala> spark.range(1, 2).toDF("x").select(timesTwoUDF($"x")).show
> +--+
> |UDF(x)|
> +--+
> | 2|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20416) Column names inconsistent for UDFs in SQL vs Dataset

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20416:
---

Assignee: Xiao Li

> Column names inconsistent for UDFs in SQL vs Dataset
> 
>
> Key: SPARK-20416
> URL: https://issues.apache.org/jira/browse/SPARK-20416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.3.0
>
>
> As you can see below, the name of the columns in SQL vs Dataset is different.
> {code}
> scala> val timesTwoUDF = spark.udf.register("timesTwo", (x: Int) => x * 2)
> timesTwoUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,IntegerType,Some(List(IntegerType)))
> scala> spark.sql("SELECT timesTwo(1)").show
> +---+
> |UDF:timesTwo(1)|
> +---+
> |  2|
> +---+
> scala> spark.range(1, 2).toDF("x").select(timesTwoUDF($"x")).show
> +--+
> |UDF(x)|
> +--+
> | 2|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20416) Column names inconsistent for UDFs in SQL vs Dataset

2017-05-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20416:
---

Assignee: Takeshi Yamamuro  (was: Xiao Li)

> Column names inconsistent for UDFs in SQL vs Dataset
> 
>
> Key: SPARK-20416
> URL: https://issues.apache.org/jira/browse/SPARK-20416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> As you can see below, the name of the columns in SQL vs Dataset is different.
> {code}
> scala> val timesTwoUDF = spark.udf.register("timesTwo", (x: Int) => x * 2)
> timesTwoUDF: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,IntegerType,Some(List(IntegerType)))
> scala> spark.sql("SELECT timesTwo(1)").show
> +---+
> |UDF:timesTwo(1)|
> +---+
> |  2|
> +---+
> scala> spark.range(1, 2).toDF("x").select(timesTwoUDF($"x")).show
> +--+
> |UDF(x)|
> +--+
> | 2|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20709) spark-shell use proxy-user failed

2017-05-11 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20709.

Resolution: Duplicate

> spark-shell use proxy-user failed
> -
>
> Key: SPARK-20709
> URL: https://issues.apache.org/jira/browse/SPARK-20709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: fangfengbin
>
> cmd is : spark-shell --master yarn-client --proxy-user leoB
> Throw Exception: failedto find any Kerberos tgt
> Log is:
> 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, 
> sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of 
> successful kerberos logins and latency (milliseconds)])
> 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, 
> sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of 
> failed kerberos logins and latency (milliseconds)])
> 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, 
> sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[GetGroups])
> 17/05/11 15:56:21 DEBUG MetricsSystemImpl: UgiMetrics, User and group related 
> metrics
> 17/05/11 15:56:22 DEBUG Shell: setsid exited with exit code 0
> 17/05/11 15:56:22 DEBUG Groups:  Creating new Groups object
> 17/05/11 15:56:22 DEBUG NativeCodeLoader: Trying to load the custom-built 
> native-hadoop library...
> 17/05/11 15:56:22 DEBUG NativeCodeLoader: Loaded the native-hadoop library
> 17/05/11 15:56:22 DEBUG JniBasedUnixGroupsMapping: Using 
> JniBasedUnixGroupsMapping for Group resolution
> 17/05/11 15:56:22 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping 
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMapping
> 17/05/11 15:56:22 DEBUG Groups: Group mapping 
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; 
> cacheTimeout=30; warningDeltaMs=5000
> 17/05/11 15:56:22 DEBUG UserGroupInformation: hadoop login
> 17/05/11 15:56:22 DEBUG UserGroupInformation: hadoop login commit
> 17/05/11 15:56:22 DEBUG UserGroupInformation: using kerberos 
> user:sp...@hadoop.com
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Using user: "sp...@hadoop.com" 
> with name sp...@hadoop.com
> 17/05/11 15:56:22 DEBUG UserGroupInformation: User entry: "sp...@hadoop.com"
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Assuming keytab is managed 
> externally since logged in from subject.
> 17/05/11 15:56:22 DEBUG UserGroupInformation: UGI loginUser:sp...@hadoop.com 
> (auth:KERBEROS)
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Current time is 1494489382449
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Next refresh is 1494541210600
> 17/05/11 15:56:22 DEBUG UserGroupInformation: PrivilegedAction as:leoB 
> (auth:PROXY) via sp...@hadoop.com (auth:KERBEROS) 
> from:org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> 17/05/11 15:56:29 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
> be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS 
> in mesos/standalone and LOCAL_DIRS in YARN).
> 17/05/11 15:56:56 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR 
> env not found!
> 17/05/11 15:56:56 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: 

[jira] [Resolved] (SPARK-19323) Upgrade breeze to 0.13

2017-05-11 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-19323.
-
Resolution: Duplicate

> Upgrade breeze to 0.13
> --
>
> Key: SPARK-19323
> URL: https://issues.apache.org/jira/browse/SPARK-19323
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: koert kuipers
>Priority: Minor
>
> SPARK-16494 upgraded breeze to 0.12. this unfortunately brings in a new 
> dependency on an old versions of shapelesss (v2.0.0). breeze 0.13 depends on 
> newer shapeless. breeze 0.13 is currently rc1 so will have to wait a bit.
> see discussion here:
> http://apache-spark-developers-list.1001551.n3.nabble.com/shapeless-in-spark-2-1-0-td20392.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006549#comment-16006549
 ] 

Thomas Graves commented on SPARK-19354:
---

just an fyi, filed https://issues.apache.org/jira/browse/SPARK-20713 for the 
other issue I mention.

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 

[jira] [Created] (SPARK-20713) Speculative task that got CommitDenied exception shows up as failed

2017-05-11 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20713:
-

 Summary: Speculative task that got CommitDenied exception shows up 
as failed
 Key: SPARK-20713
 URL: https://issues.apache.org/jira/browse/SPARK-20713
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Thomas Graves


When running speculative tasks you can end up getting a task failure on a 
speculative task (the other task succeeded) because that task got a 
CommitDenied exception when really it was "killed" by the driver. It is a race 
between when the driver kills and when the executor tries to commit.

I think ideally we should fix up the task state on this to be killed because 
the fact that this task failed doesn't matter since the other speculative task 
succeeded.  tasks showing up as failure confuse the user and could make other 
scheduler cases harder.   

This is somewhat related to SPARK-13343 where I think we should be correctly 
account for speculative tasks.  only one of the 2 tasks really succeeded and 
commited, and the other should be marked differently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-11 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-19354.
-
Resolution: Duplicate

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:659)
>   at 
> 

[jira] [Commented] (SPARK-19354) Killed tasks are getting marked as FAILED

2017-05-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006464#comment-16006464
 ] 

Thomas Graves commented on SPARK-19354:
---

thanks for pointing those out, that does fix this issue, I will dup this to 
that. To bad they didn't pull that back to 2.1. 

There is still one case tasks show up as failed when killed, which is sometimes 
with TaskCommitDenied.  It doesn't affect the blacklisting since it doesn't 
countTowardsTaskFailures though.  I'll look at this again and maybe file a 
separate jira for that if it seems like something we should fix.

> Killed tasks are getting marked as FAILED
> -
>
> Key: SPARK-19354
> URL: https://issues.apache.org/jira/browse/SPARK-19354
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Reporter: Devaraj K
>
> When we enable speculation, we can see there are multiple attempts running 
> for the same task when the first task progress is slow. If any of the task 
> attempt succeeds then the other attempts will be killed, during killing the 
> attempts those attempts are getting marked as failed due to the below error. 
> We need to handle this error and mark the attempt as KILLED instead of FAILED.
> ||93  ||214   ||1 (speculative)   ||FAILED||ANY   ||1 / 
> xx.xx.xx.x2
> stdout
> stderr||2017/01/24 10:30:44   ||0.2 s ||0.0 B / 0 ||8.0 KB / 400  
> ||java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> node2/xx.xx.xx.x2; destination host is: node1:9000; 
> +details||
> {code:xml}
> 17/01/23 23:54:32 INFO Executor: Executor is trying to kill task 93.1 in 
> stage 1.0 (TID 214)
> 17/01/23 23:54:32 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 17/01/23 23:54:32 ERROR Executor: Exception in task 93.1 in stage 1.0 (TID 
> 214)
> java.io.IOException: Failed on local exception: 
> java.nio.channels.ClosedByInterruptException; Host Details : local host is: 
> "stobdtserver3/10.224.54.70"; destination host is: "stobdtserver2":9000; 
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy17.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:804)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1133)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1124)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:88)
>   at org.apache.spark.scheduler.Task.run(Task.scala:114)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

  1   2   >