[jira] [Created] (SPARK-26251) isnan function not picking non-numeric values

2018-12-02 Thread Kunal Rao (JIRA)
Kunal Rao created SPARK-26251:
-

 Summary: isnan function not picking non-numeric values
 Key: SPARK-26251
 URL: https://issues.apache.org/jira/browse/SPARK-26251
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Kunal Rao


import org.apache.spark.sql.functions._
List("po box 7896", "8907", 
"435435").toDF("rgid").filter(isnan(col("rgid"))).show

 

should pick "po box 7896"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-26249:

Affects Version/s: (was: 2.4.0)
   3.0.0

> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules. The current API does not allow fine grain control 
> on when the optimization rule will be exercised. In the current API,  there 
> is no way to add a batch to the optimization using the SparkSessionExtensions 
> API, similar to the postHocOptimizationBatches in SparkOptimizer.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is here:
> [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26250) Fail to run dataframe.R examples

2018-12-02 Thread Jean Pierre PIN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean Pierre PIN updated SPARK-26250:

Description: 
I get an error=2 running spark-submit examples/src/main/r/dataframe.R
 the script is working with Rstudio but i've changed the library(SparkR) line  
with this one

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

i am at the top root directory of spark installation and the path variable for 
/bin is specified in the environment so spark-submit is found.  On system 
window 7 Ultimate 64bits

read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
error=2, The system cannot find the file specified

I think the issue is known for a long but i don't find any post.
 Thanks for answer.

  was:
I get an error=2 running spark-submit examples/src/main/r/dataframe.R
the script is working with Rstudio but i've changed the library(SparkR) line  
with this one

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

i am at the top root directory of spark installation and the path variable for 
/bin is specified in the environment so spark-submit is found.  On system 
window 7 pro 64bits

read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
error=2, The system cannot find the file specified

I think the issue is known for a long but i don't find any post.
Thanks for answer.


> Fail to run dataframe.R examples
> 
>
> Key: SPARK-26250
> URL: https://issues.apache.org/jira/browse/SPARK-26250
> Project: Spark
>  Issue Type: Test
>  Components: Examples
>Affects Versions: 2.4.0
>Reporter: Jean Pierre PIN
>Priority: Major
>
> I get an error=2 running spark-submit examples/src/main/r/dataframe.R
>  the script is working with Rstudio but i've changed the library(SparkR) line 
>  with this one
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> i am at the top root directory of spark installation and the path variable 
> for /bin is specified in the environment so spark-submit is found.  On system 
> window 7 Ultimate 64bits
> read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
> error=2, The system cannot find the file specified
> I think the issue is known for a long but i don't find any post.
>  Thanks for answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26250) Fail to run dataframe.R examples

2018-12-02 Thread Jean Pierre PIN (JIRA)
Jean Pierre PIN created SPARK-26250:
---

 Summary: Fail to run dataframe.R examples
 Key: SPARK-26250
 URL: https://issues.apache.org/jira/browse/SPARK-26250
 Project: Spark
  Issue Type: Test
  Components: Examples
Affects Versions: 2.4.0
Reporter: Jean Pierre PIN


I get an error=2 running spark-submit examples/src/main/r/dataframe.R
the script is working with Rstudio but i've changed the library(SparkR) line  
with this one

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

i am at the top root directory of spark installation and the path variable for 
/bin is specified in the environment so spark-submit is found.  On system 
window 7 pro 64bits

read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
error=2, The system cannot find the file specified

I think the issue is known for a long but i don't find any post.
Thanks for answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread Chen Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706751#comment-16706751
 ] 

Chen Lin commented on SPARK-26228:
--

I have tried to set spark.driver.memory from 8g to 16g.

It doesn't work.

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706721#comment-16706721
 ] 

shahid commented on SPARK-26228:


could you please increase the driver memory and check. 

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-26249:

Description: 
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules. The current API does not allow fine grain control on 
when the optimization rule will be exercised. In the current API,  there is no 
way to add a batch to the optimization using the SparkSessionExtensions API, 
similar to the postHocOptimizationBatches in SparkOptimizer.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is here:

[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]

  was:
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is here:

[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]


> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules. The current API does not allow fine grain control 
> on when the optimization rule will be exercised. In the current API,  there 
> is no way to add a batch to the optimization using the SparkSessionExtensions 
> API, similar to the postHocOptimizationBatches in SparkOptimizer.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is here:
> [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706720#comment-16706720
 ] 

Sunitha Kambhampati commented on SPARK-26249:
-

I will post a PR soon.  

> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules.  The current API does not allow fine grain control 
> on when the optimization rule will be exercised.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> In the current API,  there is no way to add a batch to the optimization using 
> the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
> SparkOptimizer.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is here:
> [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-26249:

Description: 
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is here:

[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]

  was:
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is 
[here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]]


> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules.  The current API does not allow fine grain control 
> on when the optimization rule will be exercised.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> In the current API,  there is no way to add a batch to the optimization using 
> the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
> SparkOptimizer.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is here:
> [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-26249:

Description: 
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is 
[here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]]

  was:
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is 
[here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]]


> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules.  The current API does not allow fine grain control 
> on when the optimization rule will be exercised.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> In the current API,  there is no way to add a batch to the optimization using 
> the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
> SparkOptimizer.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is 
> [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-26249:

Description: 
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order

The design spec is 
[here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]]

  was:
+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order


> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules.  The current API does not allow fine grain control 
> on when the optimization rule will be exercised.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> In the current API,  there is no way to add a batch to the optimization using 
> the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
> SparkOptimizer.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is 
> [here|[https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2018-12-02 Thread Sunitha Kambhampati (JIRA)
Sunitha Kambhampati created SPARK-26249:
---

 Summary: Extension Points Enhancements to inject a rule in order 
and to add a batch
 Key: SPARK-26249
 URL: https://issues.apache.org/jira/browse/SPARK-26249
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Sunitha Kambhampati


+Motivation:+  

Spark has extension points API to allow third parties to extend Spark with 
custom optimization rules.  The current API does not allow fine grain control 
on when the optimization rule will be exercised.

In our use cases, we have optimization rules that we want to add as extensions 
to a batch in a specific order.

In the current API,  there is no way to add a batch to the optimization using 
the SparkSessionExtensions API, similar to the postHocOptimizationBatches in 
SparkOptimizer.

+Proposal:+ 

Add 2 new API's to the existing Extension Points to allow for more flexibility 
for third party users of Spark. 
 # Inject a optimizer rule to a batch in order 
 # Inject a optimizer batch in order



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread Chen Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Lin updated SPARK-26228:
-
Attachment: 1.jpeg

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread Chen Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706706#comment-16706706
 ] 

Chen Lin commented on SPARK-26228:
--

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2124)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1086)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:123)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:345)
at 
org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at 
org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:57)

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread Chen Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706701#comment-16706701
 ] 

Chen Lin commented on SPARK-26228:
--

[~shahid]

I have upload the screenshot of log.  

I doubt there are extra costs when writing a size of 16000*16000*8 byte array.

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread Chen Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Lin updated SPARK-26228:
-
Description: 
{quote}/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */
{quote}
As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?

 

 

  was:
{quote}/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */
{quote}
As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?


> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706691#comment-16706691
 ] 

Apache Spark commented on SPARK-26117:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23190

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 3.0.0
>
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706690#comment-16706690
 ] 

Apache Spark commented on SPARK-26117:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23190

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 3.0.0
>
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE

2018-12-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26198:
--
Fix Version/s: 2.4.1
   2.3.3

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> How to reproduce this issue:
> {code}
> scala> val meta = new 
> org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
>   at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706667#comment-16706667
 ] 

shahid edited comment on SPARK-26228 at 12/3/18 5:25 AM:
-

Hi [~hibayesian], could you please share the full log of the error, if you 
have. Thanks

(btw 16000*16000*8 < 2^31 -1 )


was (Author: shahid):
Hi [~hibayesian], could you please share the full log of the error, if you 
have. Thanks

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-12-02 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706667#comment-16706667
 ] 

shahid commented on SPARK-26228:


Hi [~hibayesian], could you please share the full log of the error, if you 
have. Thanks

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-02 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-26247:
-
Description: 
This ticket tracks an SPIP to improve model load time and model serving 
interfaces for online serving of Spark MLlib models.  The SPIP is here

[https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]

 

The improvement opportunity exists in all versions of spark.  We developed our 
set of changes wrt version 2.1.0 and can port them forward to other versions 
(e.g., we have ported them forward to 2.3.2).

  was:
This ticket tracks an SPIP to improve model load time and model serving 
interfaces for online serving of Spark MLlib models.  The SPIP is here

[https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]

 

The improvement opportunity exists in all versions of spark.  We developed our 
set of changes wrt version 2.1.0 and can port them forward to other versions 
(e.g., wehave ported them forward to 2.3.2).


> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-02 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-26247:
-
Target Version/s: 3.0.0  (was: 2.1.0)

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., wehave ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-02 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-26247:
-
Fix Version/s: (was: 2.1.0)

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., wehave ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26248) Infer date type from CSV

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706494#comment-16706494
 ] 

Apache Spark commented on SPARK-26248:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23202

> Infer date type from CSV
> 
>
> Key: SPARK-26248
> URL: https://issues.apache.org/jira/browse/SPARK-26248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, DateType cannot be inferred from CSV. To parse CSV string, you 
> have to specify schema explicitly if CSV input contains dates. This ticket 
> aims to extend CSVInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26248) Infer date type from CSV

2018-12-02 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-26248:
---
Summary: Infer date type from CSV  (was: Infer date type from JSON)

> Infer date type from CSV
> 
>
> Key: SPARK-26248
> URL: https://issues.apache.org/jira/browse/SPARK-26248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, DateType cannot be inferred from CSV. To parse CSV string, you 
> have to specify schema explicitly if CSV input contains dates. This ticket 
> aims to extend CSVInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26248) Infer date type from CSV

2018-12-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26248:


Assignee: (was: Apache Spark)

> Infer date type from CSV
> 
>
> Key: SPARK-26248
> URL: https://issues.apache.org/jira/browse/SPARK-26248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, DateType cannot be inferred from CSV. To parse CSV string, you 
> have to specify schema explicitly if CSV input contains dates. This ticket 
> aims to extend CSVInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26248) Infer date type from CSV

2018-12-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26248:


Assignee: Apache Spark

> Infer date type from CSV
> 
>
> Key: SPARK-26248
> URL: https://issues.apache.org/jira/browse/SPARK-26248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, DateType cannot be inferred from CSV. To parse CSV string, you 
> have to specify schema explicitly if CSV input contains dates. This ticket 
> aims to extend CSVInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26248) Infer date type from JSON

2018-12-02 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26248:
--

 Summary: Infer date type from JSON
 Key: SPARK-26248
 URL: https://issues.apache.org/jira/browse/SPARK-26248
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, DateType cannot be inferred from CSV. To parse CSV string, you have 
to specify schema explicitly if CSV input contains dates. This ticket aims to 
extend CSVInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-02 Thread Anne Holler (JIRA)
Anne Holler created SPARK-26247:
---

 Summary: SPIP - ML Model Extension for no-Spark MLLib Online 
Serving
 Key: SPARK-26247
 URL: https://issues.apache.org/jira/browse/SPARK-26247
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Anne Holler
 Fix For: 2.1.0


This ticket tracks an SPIP to improve model load time and model serving 
interfaces for online serving of Spark MLlib models.  The SPIP is here

[https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]

 

The improvement opportunity exists in all versions of spark.  We developed our 
set of changes wrt version 2.1.0 and can port them forward to other versions 
(e.g., wehave ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26246) Infer date and timestamp types from JSON

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706466#comment-16706466
 ] 

Apache Spark commented on SPARK-26246:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23201

> Infer date and timestamp types from JSON
> 
>
> Key: SPARK-26246
> URL: https://issues.apache.org/jira/browse/SPARK-26246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, DateType and TimestampType cannot be inferred from JSON. To parse 
> JSON string, you have to specify schema explicitly if JSON input contains 
> dates or timestamps. This ticket aims to extend JsonInferSchema to support 
> such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26246) Infer date and timestamp types from JSON

2018-12-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26246:


Assignee: Apache Spark

> Infer date and timestamp types from JSON
> 
>
> Key: SPARK-26246
> URL: https://issues.apache.org/jira/browse/SPARK-26246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, DateType and TimestampType cannot be inferred from JSON. To parse 
> JSON string, you have to specify schema explicitly if JSON input contains 
> dates or timestamps. This ticket aims to extend JsonInferSchema to support 
> such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26246) Infer date and timestamp types from JSON

2018-12-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26246:


Assignee: (was: Apache Spark)

> Infer date and timestamp types from JSON
> 
>
> Key: SPARK-26246
> URL: https://issues.apache.org/jira/browse/SPARK-26246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, DateType and TimestampType cannot be inferred from JSON. To parse 
> JSON string, you have to specify schema explicitly if JSON input contains 
> dates or timestamps. This ticket aims to extend JsonInferSchema to support 
> such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26246) Infer date and timestamp types from JSON

2018-12-02 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26246:
--

 Summary: Infer date and timestamp types from JSON
 Key: SPARK-26246
 URL: https://issues.apache.org/jira/browse/SPARK-26246
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, DateType and TimestampType cannot be inferred from JSON. To parse 
JSON string, you have to specify schema explicitly if JSON input contains dates 
or timestamps. This ticket aims to extend JsonInferSchema to support such 
inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26139) Support passing shuffle metrics to exchange operator

2018-12-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26139:

Comment: was deleted

(was: User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/23128)

> Support passing shuffle metrics to exchange operator
> 
>
> Key: SPARK-26139
> URL: https://issues.apache.org/jira/browse/SPARK-26139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> Due to the way Spark's architected (SQL is defined on top of the RDD API), 
> there are two separate metrics system used in core vs SQL. Ideally, we'd want 
> to be able to get the shuffle metrics for each of the exchange operator 
> independently, e.g. blocks read, number of records.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26165) Date and Timestamp column expression is getting converted to string in less than/greater than filter query even though valid date/timestamp string literal is used in th

2018-12-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26165.
---
Resolution: Won't Fix

> Date and Timestamp column expression is getting converted to string in less 
> than/greater than filter query even though valid date/timestamp string 
> literal is used in the right side filter expression
> --
>
> Key: SPARK-26165
> URL: https://issues.apache.org/jira/browse/SPARK-26165
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: image-2018-11-26-13-00-36-896.png, 
> image-2018-11-26-13-01-28-299.png, timestamp_filter_perf.PNG
>
>
> Date and Timestamp column is getting converted to string in less than/greater 
> than filter query even though date strings that contains a time, like 
> '2018-03-18" 12:39:40' to date. Besides it's not possible to cast a string 
> like '2018-03-18 12:39:40' to a timestamp.
>  
> scala> spark.sql("""explain extended SELECT username FROM orders WHERE 
> order_creation_date > '2017-02-26 13:45:12'""").show(false);
> +---
> |== Parsed Logical Plan ==
> 'Project ['username]
> +- 'Filter ('order_creation_date > 2017-02-26 13:45:12)
>  +- 'UnresolvedRelation `orders`
> == Analyzed Logical Plan ==
> username: string
> Project [username#59]
> +- Filter (cast(order_creation_date#60 as string) > 2017-02-26 13:45:12)
>  +- SubqueryAlias orders
>  +- HiveTableRelation `default`.`orders`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [username#59, 
> order_creation_date#60, amount#61]
> == Optimized Logical Plan ==
> Project [username#59]
> +- Filter (isnotnull(order_creation_date#60) && (cast(order_creation_date#60 
> as string) > 2017-02-26 13:45:12))
>  +- HiveTableRelation `default`.`orders`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [username#59, 
> order_creation_date#60, amount#61]
> == Physical Plan ==
> *(1) Project [username#59]
> +- *(1) Filter (isnotnull(order_creation_date#60) && 
> (cast(order_creation_date#60 as string) > 2017-02-26 13:45:12))
>  +- HiveTableScan [order_creation_date#60, username#59], HiveTableRelation 
> `default`.`orders`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [username#59, order_creation
> +
> -



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL

2018-12-02 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706368#comment-16706368
 ] 

Reynold Xin commented on SPARK-26193:
-

Can we simplify it and add those metrics only to the same exchange operator as 
the read side?



> Implement shuffle write metrics in SQL
> --
>
> Key: SPARK-26193
> URL: https://issues.apache.org/jira/browse/SPARK-26193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26193) Implement shuffle write metrics in SQL

2018-12-02 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706359#comment-16706359
 ] 

Yuanjian Li commented on SPARK-26193:
-

cc [~smilegator] [~cloud_fan] and [~rxin], cause the writer side of shuffle 
metrics need more changes than reader side, add a sketch design and demo doc in 
this jira, I'll give a PR soon after you think the implement describe in doc is 
ok. Thanks :) 

> Implement shuffle write metrics in SQL
> --
>
> Key: SPARK-26193
> URL: https://issues.apache.org/jira/browse/SPARK-26193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26198) Metadata serialize null values throw NPE

2018-12-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26198.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23164
[https://github.com/apache/spark/pull/23164]

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> How to reproduce this issue:
> {code}
> scala> val meta = new 
> org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
>   at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26198) Metadata serialize null values throw NPE

2018-12-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26198:
-

Assignee: Yuming Wang

> Metadata serialize null values throw NPE
> 
>
> Key: SPARK-26198
> URL: https://issues.apache.org/jira/browse/SPARK-26198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> How to reproduce this issue:
> {code}
> scala> val meta = new 
> org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
>   at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706176#comment-16706176
 ] 

Apache Spark commented on SPARK-26034:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23200

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706175#comment-16706175
 ] 

Apache Spark commented on SPARK-26034:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23200

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26033) Break large ml/tests.py files into smaller files

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706174#comment-16706174
 ] 

Apache Spark commented on SPARK-26033:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23200

> Break large ml/tests.py files into smaller files
> 
>
> Key: SPARK-26033
> URL: https://issues.apache.org/jira/browse/SPARK-26033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26242) Leading slash breaks proxying

2018-12-02 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-26242.
-
Resolution: Not A Problem

> Leading slash breaks proxying
> -
>
> Key: SPARK-26242
> URL: https://issues.apache.org/jira/browse/SPARK-26242
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Ryan Lovett
>Priority: Minor
>
> The WebUI prefixes "/" at the beginning of each link path (e.g. /jobs) which 
> breaks navigation when attempting to proxy the app at another URL. In my 
> case, a pyspark user creates a SparkContext within a JupyterHub-hosted 
> notebook and attempts to proxy it with nbserverproxy off of 
> address.of.jupyter.hub/user/proxy/4040/. Since WebUI sets the URLs of its 
> pages to begin with "/", the browser sends the user to 
> address.of.jupyter.hub/jobs, address.of.jupyter.hub/stages, etc.
>  
> Similar: 
> [https://github.com/mesosphere/spark/commit/ada99f1b3801e81db2e367f219377e93f5d32655|https://github.com/apache/spark/pull/11369]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26242) Leading slash breaks proxying

2018-12-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706157#comment-16706157
 ] 

Marco Gaido commented on SPARK-26242:
-

Let me close this. Please reopen only if you find issues. In the future, 
please, if you have questions send them to the mailing lists and open a JIRA 
only if you find an incorrect behavior. Thanks.

> Leading slash breaks proxying
> -
>
> Key: SPARK-26242
> URL: https://issues.apache.org/jira/browse/SPARK-26242
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Ryan Lovett
>Priority: Minor
>
> The WebUI prefixes "/" at the beginning of each link path (e.g. /jobs) which 
> breaks navigation when attempting to proxy the app at another URL. In my 
> case, a pyspark user creates a SparkContext within a JupyterHub-hosted 
> notebook and attempts to proxy it with nbserverproxy off of 
> address.of.jupyter.hub/user/proxy/4040/. Since WebUI sets the URLs of its 
> pages to begin with "/", the browser sends the user to 
> address.of.jupyter.hub/jobs, address.of.jupyter.hub/stages, etc.
>  
> Similar: 
> [https://github.com/mesosphere/spark/commit/ada99f1b3801e81db2e367f219377e93f5d32655|https://github.com/apache/spark/pull/11369]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23899) Built-in SQL Function Improvement

2018-12-02 Thread Arseniy Tashoyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706156#comment-16706156
 ] 

Arseniy Tashoyan commented on SPARK-23899:
--

What do you think about this one: SPARK-23693?

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-23899
> URL: https://issues.apache.org/jira/browse/SPARK-23899
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> This umbrella JIRA is to improve compatibility with the other data processing 
> systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and 
> MS SQL Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26080) Unable to run worker.py on Windows

2018-12-02 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26080:


Assignee: Hyukjin Kwon

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26080) Unable to run worker.py on Windows

2018-12-02 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26080.
--
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23055
[https://github.com/apache/spark/pull/23055]

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.0.0, 2.4.1
>
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2018-12-02 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26208.
--
   Resolution: Fixed
 Assignee: Koert Kuipers
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/23173

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26245) Add Float literal

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706146#comment-16706146
 ] 

Apache Spark commented on SPARK-26245:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/23199

> Add Float literal
> -
>
> Key: SPARK-26245
> URL: https://issues.apache.org/jira/browse/SPARK-26245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26245) Add Float literal

2018-12-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706145#comment-16706145
 ] 

Apache Spark commented on SPARK-26245:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/23199

> Add Float literal
> -
>
> Key: SPARK-26245
> URL: https://issues.apache.org/jira/browse/SPARK-26245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26245) Add Float literal

2018-12-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26245:


Assignee: Apache Spark

> Add Float literal
> -
>
> Key: SPARK-26245
> URL: https://issues.apache.org/jira/browse/SPARK-26245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26245) Add Float literal

2018-12-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26245:


Assignee: (was: Apache Spark)

> Add Float literal
> -
>
> Key: SPARK-26245
> URL: https://issues.apache.org/jira/browse/SPARK-26245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26245) Add Float literal

2018-12-02 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-26245:
---

 Summary: Add Float literal
 Key: SPARK-26245
 URL: https://issues.apache.org/jira/browse/SPARK-26245
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org