date:20180210

[jira] [Commented] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359720#comment-16359720
 ] 

Apache Spark commented on SPARK-23382:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/20570

> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> --
>
> Key: SPARK-23382
> URL: https://issues.apache.org/jira/browse/SPARK-23382
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> Specific reasons, please refer to 
> https://issues.apache.org/jira/browse/SPARK-23024



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23382:


Assignee: (was: Apache Spark)

> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> --
>
> Key: SPARK-23382
> URL: https://issues.apache.org/jira/browse/SPARK-23382
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> Specific reasons, please refer to 
> https://issues.apache.org/jira/browse/SPARK-23024



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23382:


Assignee: Apache Spark

> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> --
>
> Key: SPARK-23382
> URL: https://issues.apache.org/jira/browse/SPARK-23382
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> Spark Streaming ui about the contents of the form need to have hidden and 
> show features, when the table records very much.
> Specific reasons, please refer to 
> https://issues.apache.org/jira/browse/SPARK-23024



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10912) Improve Spark metrics executor.filesystem

2018-02-10 Thread Harel Ben Attia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359810#comment-16359810
 ] 

Harel Ben Attia edited comment on SPARK-10912 at 2/11/18 6:25 AM:
--

We would really be glad to see this happening as well, without the need to 
change spark's source code.

Also, externalizing the array to a configuration properly in metrics.properties 
would be best (or auto-supporting each used FileSystem schema obviously, but 
this might include bigger changes to the registration logic, so it's not 
necessary).

 

btw, [~srowen] - The main benefit of getting it from spark itself is that it 
provides this filesystem data on a per-executor/driver basis, and not 
aggregated, allowing for much better debugging and troubleshooting.


was (Author: harelba):
We would really be glad to see this happening as well, without the need to 
change spark's source code.

Also, externalizing the array to a configuration properly in metrics.properties 
would be best (or auto-supporting each used FileSystem schema obviously, but 
this might include bigger changes to the registration logic, so it's not 
necessary).

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10912) Improve Spark metrics executor.filesystem

2018-02-10 Thread Harel Ben Attia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359810#comment-16359810
 ] 

Harel Ben Attia commented on SPARK-10912:
-

We would really be glad to see this happening as well, without the need to 
change spark's source code.

Also, externalizing the array to a configuration properly in metrics.properties 
would be best (or auto-supporting each used FileSystem schema obviously, but 
this might include bigger changes to the registration logic, so it's not 
necessary).

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-02-10 Thread guoxiaolongzte (JIRA)

guoxiaolongzte created SPARK-23382:
--

 Summary: Spark Streaming ui about the contents of the form need to 
have hidden and show features, when the table records very much.
 Key: SPARK-23382
 URL: https://issues.apache.org/jira/browse/SPARK-23382
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0
Reporter: guoxiaolongzte


Spark Streaming ui about the contents of the form need to have hidden and show 
features, when the table records very much.

Specific reasons, please refer to 
https://issues.apache.org/jira/browse/SPARK-23024



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options

2018-02-10 Thread Kent Yao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-23383:
-
Description: 

{code:java}
./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/Kent/Documents/spark
+ DISTDIR=/Users/Kent/Documents/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/Kent/Documents/spark/build/mvn
+ ((  5  ))
+ case $1 in
+ NAME=ne-1.0.0-SNAPSHOT
+ shift
+ shift
+ ((  3  ))
+ case $1 in
+ break
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
++ command -v git
+ '[' /usr/local/bin/git ']'
++ git rev-parse --short HEAD
+ GITREV=98ea6a7
+ '[' '!' -z 98ea6a7 ']'
+ GITREVSTRING=' (git revision 98ea6a7)'
+ unset GITREV
++ command -v /Users/Kent/Documents/spark/build/mvn
+ '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
-Dexpression=project.version xyz --tgz -Phadoop-2.7
++ grep -v INFO
++ tail -n 1
+ VERSION=' -X,--debug Produce execution debug 
output'
{code}

It is better to declare the mistakes and exit with usage


  was:
```
./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/Kent/Documents/spark
+ DISTDIR=/Users/Kent/Documents/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/Kent/Documents/spark/build/mvn
+ ((  5  ))
+ case $1 in
+ NAME=ne-1.0.0-SNAPSHOT
+ shift
+ shift
+ ((  3  ))
+ case $1 in
+ break
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
++ command -v git
+ '[' /usr/local/bin/git ']'
++ git rev-parse --short HEAD
+ GITREV=98ea6a7
+ '[' '!' -z 98ea6a7 ']'
+ GITREVSTRING=' (git revision 98ea6a7)'
+ unset GITREV
++ command -v /Users/Kent/Documents/spark/build/mvn
+ '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
-Dexpression=project.version xyz --tgz -Phadoop-2.7
++ grep -v INFO
++ tail -n 1
+ VERSION=' -X,--debug Produce execution debug 
output'
```

It is better to declare the mistakes and exit with usage


> Make a distribution should exit with usage while detecting wrong options
> 
>
> Key: SPARK-23383
> URL: https://issues.apache.org/jira/browse/SPARK-23383
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
> +++ dirname ./dev/make-distribution.sh
> ++ cd ./dev/..
> ++ pwd
> + SPARK_HOME=/Users/Kent/Documents/spark
> + DISTDIR=/Users/Kent/Documents/spark/dist
> + MAKE_TGZ=false
> + MAKE_PIP=false
> + MAKE_R=false
> + NAME=none
> + MVN=/Users/Kent/Documents/spark/build/mvn
> + ((  5  ))
> + case $1 in
> + NAME=ne-1.0.0-SNAPSHOT
> + shift
> + shift
> + ((  3  ))
> + case $1 in
> + break
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> ++ command -v git
> + '[' /usr/local/bin/git ']'
> ++ git rev-parse --short HEAD
> + GITREV=98ea6a7
> + '[' '!' -z 98ea6a7 ']'
> + GITREVSTRING=' (git revision 98ea6a7)'
> + unset GITREV
> ++ command -v /Users/Kent/Documents/spark/build/mvn
> + '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
> ++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
> -Dexpression=project.version xyz --tgz -Phadoop-2.7
> ++ grep -v INFO
> ++ tail -n 1
> + VERSION=' -X,--debug Produce execution debug 
> output'
> {code}
> It is better to declare the mistakes and exit with usage



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23340) Update ORC to 1.4.3

2018-02-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Description: 
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).

  was:ORC 1.4.2 is released on January 23rd. This release removes unnecessary 
dependencies.


>  Update ORC to 1.4.3
> 
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI

2018-02-10 Thread Lantao Jin (JIRA)

Lantao Jin created SPARK-23385:
--

 Summary: Allow SparkUITab to be customized adding in SparkConf and 
loaded when creating SparkUI
 Key: SPARK-23385
 URL: https://issues.apache.org/jira/browse/SPARK-23385
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Lantao Jin


It would be nice if there was a mechanism to allow to add customized SparkUITab 
(embedded like Jobs, Stages, Storage, Environment, Executors,...) to be 
registered through SparkConf settings. This would be more flexible when we need 
display some special information in UI rather than adding the embedded one by 
one and wait community to merge.

I propose to introduce a new configuration option, spark.extraUITabs, that 
allows customized WebUITab to be specified in SparkConf and registered when 
SparkUI is created. Here is the proposed documentation for the new option:
{quote}
A comma-separated list of classes that implement SparkUITab; when initializing 
SparkUI, instances of these classes will be created and registered to the tabs 
array in SparkUI. If a class has a two-argument constructor that accepts a 
SparkUI and AppStatusStore, that constructor will be called; If a class has a 
single-argument constructor that accepts a SparkUI; otherwise, a zero-argument 
constructor will be called. If no valid constructor can be found, the SparkUI 
creation will fail with an exception.
{quote}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23385:


Assignee: (was: Apache Spark)

> Allow SparkUITab to be customized adding in SparkConf and loaded when 
> creating SparkUI
> --
>
> Key: SPARK-23385
> URL: https://issues.apache.org/jira/browse/SPARK-23385
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Major
>
> It would be nice if there was a mechanism to allow to add customized 
> SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) 
> to be registered through SparkConf settings. This would be more flexible when 
> we need display some special information in UI rather than adding the 
> embedded one by one and wait community to merge.
> I propose to introduce a new configuration option, spark.extraUITabs, that 
> allows customized WebUITab to be specified in SparkConf and registered when 
> SparkUI is created. Here is the proposed documentation for the new option:
> {quote}
> A comma-separated list of classes that implement SparkUITab; when 
> initializing SparkUI, instances of these classes will be created and 
> registered to the tabs array in SparkUI. If a class has a two-argument 
> constructor that accepts a SparkUI and AppStatusStore, that constructor will 
> be called; If a class has a single-argument constructor that accepts a 
> SparkUI; otherwise, a zero-argument constructor will be called. If no valid 
> constructor can be found, the SparkUI creation will fail with an exception.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359825#comment-16359825
 ] 

Apache Spark commented on SPARK-23385:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20574

> Allow SparkUITab to be customized adding in SparkConf and loaded when 
> creating SparkUI
> --
>
> Key: SPARK-23385
> URL: https://issues.apache.org/jira/browse/SPARK-23385
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Major
>
> It would be nice if there was a mechanism to allow to add customized 
> SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) 
> to be registered through SparkConf settings. This would be more flexible when 
> we need display some special information in UI rather than adding the 
> embedded one by one and wait community to merge.
> I propose to introduce a new configuration option, spark.extraUITabs, that 
> allows customized WebUITab to be specified in SparkConf and registered when 
> SparkUI is created. Here is the proposed documentation for the new option:
> {quote}
> A comma-separated list of classes that implement SparkUITab; when 
> initializing SparkUI, instances of these classes will be created and 
> registered to the tabs array in SparkUI. If a class has a two-argument 
> constructor that accepts a SparkUI and AppStatusStore, that constructor will 
> be called; If a class has a single-argument constructor that accepts a 
> SparkUI; otherwise, a zero-argument constructor will be called. If no valid 
> constructor can be found, the SparkUI creation will fail with an exception.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23385:


Assignee: Apache Spark

> Allow SparkUITab to be customized adding in SparkConf and loaded when 
> creating SparkUI
> --
>
> Key: SPARK-23385
> URL: https://issues.apache.org/jira/browse/SPARK-23385
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> It would be nice if there was a mechanism to allow to add customized 
> SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) 
> to be registered through SparkConf settings. This would be more flexible when 
> we need display some special information in UI rather than adding the 
> embedded one by one and wait community to merge.
> I propose to introduce a new configuration option, spark.extraUITabs, that 
> allows customized WebUITab to be specified in SparkConf and registered when 
> SparkUI is created. Here is the proposed documentation for the new option:
> {quote}
> A comma-separated list of classes that implement SparkUITab; when 
> initializing SparkUI, instances of these classes will be created and 
> registered to the tabs array in SparkUI. If a class has a two-argument 
> constructor that accepts a SparkUI and AppStatusStore, that constructor will 
> be called; If a class has a single-argument constructor that accepts a 
> SparkUI; otherwise, a zero-argument constructor will be called. If no valid 
> constructor can be found, the SparkUI creation will fail with an exception.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23386) Enable direct application links before replay

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23386:


Assignee: Apache Spark

> Enable direct application links before replay
> -
>
> Key: SPARK-23386
> URL: https://issues.apache.org/jira/browse/SPARK-23386
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Apache Spark
>Priority: Major
>
> In a deployment with multiple 10K of large event logs it may take *many 
> hours* until all logs are replayed. Most our users use SHS by clicking on a 
> link in a client log in case of an error. Direct links currently don't work 
> until the event log is processed in a replay thread. This Jira proposes to 
> link appid to the event logs already during scan, without a full replay. This 
> makes on-demand retrievals accessible almost immediately upon SHS start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23386) Enable direct application links before replay

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23386:


Assignee: (was: Apache Spark)

> Enable direct application links before replay
> -
>
> Key: SPARK-23386
> URL: https://issues.apache.org/jira/browse/SPARK-23386
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Major
>
> In a deployment with multiple 10K of large event logs it may take *many 
> hours* until all logs are replayed. Most our users use SHS by clicking on a 
> link in a client log in case of an error. Direct links currently don't work 
> until the event log is processed in a replay thread. This Jira proposes to 
> link appid to the event logs already during scan, without a full replay. This 
> makes on-demand retrievals accessible almost immediately upon SHS start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23386) Enable direct application links before replay

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359831#comment-16359831
 ] 

Apache Spark commented on SPARK-23386:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/20575

> Enable direct application links before replay
> -
>
> Key: SPARK-23386
> URL: https://issues.apache.org/jira/browse/SPARK-23386
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Major
>
> In a deployment with multiple 10K of large event logs it may take *many 
> hours* until all logs are replayed. Most our users use SHS by clicking on a 
> link in a client log in case of an error. Direct links currently don't work 
> until the event log is processed in a replay thread. This Jira proposes to 
> link appid to the event logs already during scan, without a full replay. This 
> makes on-demand retrievals accessible almost immediately upon SHS start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-10 Thread Shintaro Murakami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359690#comment-16359690
 ] 

Shintaro Murakami commented on SPARK-23381:
---

FeatureHasher in MLLib uses Murmur3 in hashing indices. If I made an online 
prediction in another environment like C++ predict server,  the indices do not 
match and can not predict correctly.

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359763#comment-16359763
 ] 

Apache Spark commented on SPARK-23383:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/20571

> Make a distribution should exit with usage while detecting wrong options
> 
>
> Key: SPARK-23383
> URL: https://issues.apache.org/jira/browse/SPARK-23383
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
> +++ dirname ./dev/make-distribution.sh
> ++ cd ./dev/..
> ++ pwd
> + SPARK_HOME=/Users/Kent/Documents/spark
> + DISTDIR=/Users/Kent/Documents/spark/dist
> + MAKE_TGZ=false
> + MAKE_PIP=false
> + MAKE_R=false
> + NAME=none
> + MVN=/Users/Kent/Documents/spark/build/mvn
> + ((  5  ))
> + case $1 in
> + NAME=ne-1.0.0-SNAPSHOT
> + shift
> + shift
> + ((  3  ))
> + case $1 in
> + break
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> ++ command -v git
> + '[' /usr/local/bin/git ']'
> ++ git rev-parse --short HEAD
> + GITREV=98ea6a7
> + '[' '!' -z 98ea6a7 ']'
> + GITREVSTRING=' (git revision 98ea6a7)'
> + unset GITREV
> ++ command -v /Users/Kent/Documents/spark/build/mvn
> + '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
> ++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
> -Dexpression=project.version xyz --tgz -Phadoop-2.7
> ++ grep -v INFO
> ++ tail -n 1
> + VERSION=' -X,--debug Produce execution debug 
> output'
> {code}
> It is better to declare the mistakes and exit with usage



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23383:


Assignee: Apache Spark

> Make a distribution should exit with usage while detecting wrong options
> 
>
> Key: SPARK-23383
> URL: https://issues.apache.org/jira/browse/SPARK-23383
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
> +++ dirname ./dev/make-distribution.sh
> ++ cd ./dev/..
> ++ pwd
> + SPARK_HOME=/Users/Kent/Documents/spark
> + DISTDIR=/Users/Kent/Documents/spark/dist
> + MAKE_TGZ=false
> + MAKE_PIP=false
> + MAKE_R=false
> + NAME=none
> + MVN=/Users/Kent/Documents/spark/build/mvn
> + ((  5  ))
> + case $1 in
> + NAME=ne-1.0.0-SNAPSHOT
> + shift
> + shift
> + ((  3  ))
> + case $1 in
> + break
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> ++ command -v git
> + '[' /usr/local/bin/git ']'
> ++ git rev-parse --short HEAD
> + GITREV=98ea6a7
> + '[' '!' -z 98ea6a7 ']'
> + GITREVSTRING=' (git revision 98ea6a7)'
> + unset GITREV
> ++ command -v /Users/Kent/Documents/spark/build/mvn
> + '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
> ++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
> -Dexpression=project.version xyz --tgz -Phadoop-2.7
> ++ grep -v INFO
> ++ tail -n 1
> + VERSION=' -X,--debug Produce execution debug 
> output'
> {code}
> It is better to declare the mistakes and exit with usage



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23383:


Assignee: (was: Apache Spark)

> Make a distribution should exit with usage while detecting wrong options
> 
>
> Key: SPARK-23383
> URL: https://issues.apache.org/jira/browse/SPARK-23383
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
> +++ dirname ./dev/make-distribution.sh
> ++ cd ./dev/..
> ++ pwd
> + SPARK_HOME=/Users/Kent/Documents/spark
> + DISTDIR=/Users/Kent/Documents/spark/dist
> + MAKE_TGZ=false
> + MAKE_PIP=false
> + MAKE_R=false
> + NAME=none
> + MVN=/Users/Kent/Documents/spark/build/mvn
> + ((  5  ))
> + case $1 in
> + NAME=ne-1.0.0-SNAPSHOT
> + shift
> + shift
> + ((  3  ))
> + case $1 in
> + break
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> + '[' -z /Users/Kent/.jenv/candidates/java/current ']'
> ++ command -v git
> + '[' /usr/local/bin/git ']'
> ++ git rev-parse --short HEAD
> + GITREV=98ea6a7
> + '[' '!' -z 98ea6a7 ']'
> + GITREVSTRING=' (git revision 98ea6a7)'
> + unset GITREV
> ++ command -v /Users/Kent/Documents/spark/build/mvn
> + '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
> ++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
> -Dexpression=project.version xyz --tgz -Phadoop-2.7
> ++ grep -v INFO
> ++ tail -n 1
> + VERSION=' -X,--debug Produce execution debug 
> output'
> {code}
> It is better to declare the mistakes and exit with usage



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23340) Update ORC to 1.4.3

2018-02-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Summary:  Update ORC to 1.4.3  (was:  Update ORC to 1.4.2)

>  Update ORC to 1.4.3
> 
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ORC 1.4.2 is released on January 23rd. This release removes unnecessary 
> dependencies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23386) Enable direct application links before replay

2018-02-10 Thread Gera Shegalov (JIRA)

Gera Shegalov created SPARK-23386:
-

 Summary: Enable direct application links before replay
 Key: SPARK-23386
 URL: https://issues.apache.org/jira/browse/SPARK-23386
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.2.1
Reporter: Gera Shegalov


In a deployment with multiple 10K of large event logs it may take *many hours* 
until all logs are replayed. Most our users use SHS by clicking on a link in a 
client log in case of an error. Direct links currently don't work until the 
event log is processed in a replay thread. This Jira proposes to link appid to 
the event logs already during scan, without a full replay. This makes on-demand 
retrievals accessible almost immediately upon SHS start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options

2018-02-10 Thread Kent Yao (JIRA)

Kent Yao created SPARK-23383:


 Summary: Make a distribution should exit with usage while 
detecting wrong options
 Key: SPARK-23383
 URL: https://issues.apache.org/jira/browse/SPARK-23383
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
Reporter: Kent Yao


```
./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz  -Phadoop-2.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/Kent/Documents/spark
+ DISTDIR=/Users/Kent/Documents/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/Kent/Documents/spark/build/mvn
+ ((  5  ))
+ case $1 in
+ NAME=ne-1.0.0-SNAPSHOT
+ shift
+ shift
+ ((  3  ))
+ case $1 in
+ break
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
+ '[' -z /Users/Kent/.jenv/candidates/java/current ']'
++ command -v git
+ '[' /usr/local/bin/git ']'
++ git rev-parse --short HEAD
+ GITREV=98ea6a7
+ '[' '!' -z 98ea6a7 ']'
+ GITREVSTRING=' (git revision 98ea6a7)'
+ unset GITREV
++ command -v /Users/Kent/Documents/spark/build/mvn
+ '[' '!' /Users/Kent/Documents/spark/build/mvn ']'
++ /Users/Kent/Documents/spark/build/mvn help:evaluate 
-Dexpression=project.version xyz --tgz -Phadoop-2.7
++ grep -v INFO
++ tail -n 1
+ VERSION=' -X,--debug Produce execution debug 
output'
```

It is better to declare the mistakes and exit with usage



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17147:


Assignee: Apache Spark

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Assignee: Apache Spark
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359786#comment-16359786
 ] 

Apache Spark commented on SPARK-17147:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/20572

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17147:


Assignee: (was: Apache Spark)

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-10 Thread Harleen Singh Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359788#comment-16359788
 ] 

Harleen Singh Mann commented on SPARK-23370:


This goes as far as I understand: 
 * JDBC driver: Once we create the result set object using the jdbc driver, it 
will contain all the actual data as well as the metadata for the concerned DB 
table. 
 * Query additional table (all_tab_rows): This would entail creating another 
result set that will capture the metadata for the concerned DB table as data 
(rows). Overhead:
 ** Connection: None. Since it will use pooling
 ** Retrieving result: Low impact. Since we will push down the predicate to the 
DB to filter data only for the concerned table

I believe that "all_tab_rows" table should be queried on the driver and 
broadcast to the executors. Does this make sense?

Can we get some inputs from someone else as well?

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Minor
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining

2018-02-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359639#comment-16359639
 ] 

Sean Owen commented on SPARK-10697:
---

Yes, I think it's OK to add. Go ahead and propose a PR.

Lift is confidence, normalized for the prior probability of observing the 
antecedent at all. Yes it is the right tool when evaluating rules vs each other 
for interest. It's a likelihood ratio.

Confidence is of interest when you know you have the antecedent (e.g. already 
added those items to a basket) and want to know about consequents. There the 
prior probability would be irrelevant.

You can compute lift from confidence but it's extra work and so does make some 
sense to compute this along the way.

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.

2018-02-10 Thread guoxiaolongzte (JIRA)

guoxiaolongzte created SPARK-23384:
--

 Summary: When it has no incomplete(completed) applications found, 
the last updated time is not formatted and client local time zone is not show 
in history server web ui.
 Key: SPARK-23384
 URL: https://issues.apache.org/jira/browse/SPARK-23384
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: guoxiaolongzte


When it has no incomplete(completed) applications found, the last updated time 
is not formatted and client local time zone is not show in history server web 
ui. It is a bug.

fix before:

 

fix after:

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.

2018-02-10 Thread guoxiaolongzte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23384:
---
Attachment: 2.png
1.png

> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui.
> 
>
> Key: SPARK-23384
> URL: https://issues.apache.org/jira/browse/SPARK-23384
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui. It is a bug.
> fix before:
>  
> fix after:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.

2018-02-10 Thread guoxiaolongzte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23384:
---
Description: 
When it has no incomplete(completed) applications found, the last updated time 
is not formatted and client local time zone is not show in history server web 
ui. It is a bug.

fix before: !1.png!

fix after:

!2.png!

 

  was:
When it has no incomplete(completed) applications found, the last updated time 
is not formatted and client local time zone is not show in history server web 
ui. It is a bug.

fix before:

 

fix after:

 


> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui.
> 
>
> Key: SPARK-23384
> URL: https://issues.apache.org/jira/browse/SPARK-23384
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui. It is a bug.
> fix before: !1.png!
> fix after:
> !2.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23384:


Assignee: Apache Spark

> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui.
> 
>
> Key: SPARK-23384
> URL: https://issues.apache.org/jira/browse/SPARK-23384
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui. It is a bug.
> fix before: !1.png!
> fix after:
> !2.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359820#comment-16359820
 ] 

Apache Spark commented on SPARK-23384:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/20573

> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui.
> 
>
> Key: SPARK-23384
> URL: https://issues.apache.org/jira/browse/SPARK-23384
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui. It is a bug.
> fix before: !1.png!
> fix after:
> !2.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23384:


Assignee: (was: Apache Spark)

> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui.
> 
>
> Key: SPARK-23384
> URL: https://issues.apache.org/jira/browse/SPARK-23384
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> When it has no incomplete(completed) applications found, the last updated 
> time is not formatted and client local time zone is not show in history 
> server web ui. It is a bug.
> fix before: !1.png!
> fix after:
> !2.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is mismatched

2018-02-10 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-23380:


 Summary: Make toPandas fall back to Arrow optimization disabled 
when schema is mismatched
 Key: SPARK-23380
 URL: https://issues.apache.org/jira/browse/SPARK-23380
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Seems we can check the schema ahead and fall back in toPandas.

Please see this case below:

{code}
df = spark.createDataFrame([[{'a': 1}]])

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
{code}

{code}
...
py4j.protocol.Py4JJavaError: An error occurred while calling 
o42.collectAsArrowToPython.
...
java.lang.UnsupportedOperationException: Unsupported data type: 
map
{code}

In case of {{createDataFrame}}, we fall back to make this at least working even 
though the optimisation is disabled.

{code}
df = spark.createDataFrame([[{'a': 1}]])
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pdf = df.toPandas()
spark.createDataFrame(pdf).show()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.createDataFrame(pdf).show()
{code}

{code}
...
... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
Arrow type ...
++
|  _1|
++
|[a -> 1]|
++
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23380:


Assignee: Apache Spark

> Make toPandas fall back to Arrow optimization disabled when schema is not 
> supported in the Arrow optimization 
> --
>
> Key: SPARK-23380
> URL: https://issues.apache.org/jira/browse/SPARK-23380
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Seems we can check the schema ahead and fall back in toPandas.
> Please see this case below:
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o42.collectAsArrowToPython.
> ...
> java.lang.UnsupportedOperationException: Unsupported data type: 
> map
> {code}
> In case of {{createDataFrame}}, we fall back to make this at least working 
> even though the optimisation is disabled.
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> pdf = df.toPandas()
> spark.createDataFrame(pdf).show()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> spark.createDataFrame(pdf).show()
> {code}
> {code}
> ...
> ... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
> Arrow type ...
> ++
> |  _1|
> ++
> |[a -> 1]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization

2018-02-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23380:
-
Summary: Make toPandas fall back to Arrow optimization disabled when schema 
is not supported in the Arrow optimization   (was: Make toPandas fall back to 
Arrow optimization disabled when schema is mismatched)

> Make toPandas fall back to Arrow optimization disabled when schema is not 
> supported in the Arrow optimization 
> --
>
> Key: SPARK-23380
> URL: https://issues.apache.org/jira/browse/SPARK-23380
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Seems we can check the schema ahead and fall back in toPandas.
> Please see this case below:
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o42.collectAsArrowToPython.
> ...
> java.lang.UnsupportedOperationException: Unsupported data type: 
> map
> {code}
> In case of {{createDataFrame}}, we fall back to make this at least working 
> even though the optimisation is disabled.
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> pdf = df.toPandas()
> spark.createDataFrame(pdf).show()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> spark.createDataFrame(pdf).show()
> {code}
> {code}
> ...
> ... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
> Arrow type ...
> ++
> |  _1|
> ++
> |[a -> 1]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23377:


Assignee: Apache Spark

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Major
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359340#comment-16359340
 ] 

Apache Spark commented on SPARK-23377:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20566

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23377:


Assignee: (was: Apache Spark)

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23380:


Assignee: (was: Apache Spark)

> Make toPandas fall back to Arrow optimization disabled when schema is not 
> supported in the Arrow optimization 
> --
>
> Key: SPARK-23380
> URL: https://issues.apache.org/jira/browse/SPARK-23380
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Seems we can check the schema ahead and fall back in toPandas.
> Please see this case below:
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o42.collectAsArrowToPython.
> ...
> java.lang.UnsupportedOperationException: Unsupported data type: 
> map
> {code}
> In case of {{createDataFrame}}, we fall back to make this at least working 
> even though the optimisation is disabled.
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> pdf = df.toPandas()
> spark.createDataFrame(pdf).show()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> spark.createDataFrame(pdf).show()
> {code}
> {code}
> ...
> ... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
> Arrow type ...
> ++
> |  _1|
> ++
> |[a -> 1]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359349#comment-16359349
 ] 

Apache Spark commented on SPARK-23380:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20567

> Make toPandas fall back to Arrow optimization disabled when schema is not 
> supported in the Arrow optimization 
> --
>
> Key: SPARK-23380
> URL: https://issues.apache.org/jira/browse/SPARK-23380
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Seems we can check the schema ahead and fall back in toPandas.
> Please see this case below:
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
> ...
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o42.collectAsArrowToPython.
> ...
> java.lang.UnsupportedOperationException: Unsupported data type: 
> map
> {code}
> In case of {{createDataFrame}}, we fall back to make this at least working 
> even though the optimisation is disabled.
> {code}
> df = spark.createDataFrame([[{'a': 1}]])
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> pdf = df.toPandas()
> spark.createDataFrame(pdf).show()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> spark.createDataFrame(pdf).show()
> {code}
> {code}
> ...
> ... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
> Arrow type ...
> ++
> |  _1|
> ++
> |[a -> 1]|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-10 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359429#comment-16359429
 ] 

Kazuaki Ishizaki commented on SPARK-23310:
--

got it, thanks

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Sital Kedia
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-10 Thread Shintaro Murakami (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shintaro Murakami updated SPARK-23381:
--
Summary: Murmur3 hash generates a different value from other 
implementations  (was: Murmur3 hash generates a different value )

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23381) Murmur3 hash generates a different value

2018-02-10 Thread Shintaro Murakami (JIRA)

Shintaro Murakami created SPARK-23381:
-

 Summary: Murmur3 hash generates a different value 
 Key: SPARK-23381
 URL: https://issues.apache.org/jira/browse/SPARK-23381
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Shintaro Murakami


Murmur3 hash generates a different value from the original and other 
implementations (like Scala standard library and Guava or so) when the length 
of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23381:


Assignee: (was: Apache Spark)

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23381:


Assignee: Apache Spark

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Assignee: Apache Spark
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359445#comment-16359445
 ] 

Apache Spark commented on SPARK-23381:
--

User 'mrkm4ntr' has created a pull request for this issue:
https://github.com/apache/spark/pull/20568

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout

2018-02-10 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359568#comment-16359568
 ] 

Bruce Robbins commented on SPARK-23240:
---

A little background. A Spark installation had a Python sitecustomize.py like 
this:

 
{code:java}
try:
   import flotilla
 except ImportError as e:
   print e{code}
 

(flotilla is not the real python module, I just use that as an example).

Because flotilla was not installed on the user's cluster, the first output in 
daemon's stdout was:

 
{noformat}
No module named flotilla{noformat}
 

In fact, this is what I get when I run pyspark.daemon with this 
sitecustomize.py installed:
{noformat}
bash-3.2$ python -m pyspark.daemon
python -m pyspark.daemon
No module named flotilla
^@^@\325{noformat}
Therefore, PythonWorkerFactory.startDaemon reads 'No m', or 0x4e6f206d or 
1315905645, as the port number.

Here's what happens when I run a pyspark action with the above sitecustomize.py 
installed:
{noformat}
>>> text_file = sc.textFile("/Users/bruce/ncdc_gsod").count()
odule named flotilla
18/02/10 09:44:27 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.IllegalArgumentException: port out of range:1315905645
 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
 at java.net.InetSocketAddress.(InetSocketAddress.java:188)
 at java.net.Socket.(Socket.java:244){noformat}




 

> PythonWorkerFactory issues unhelpful message when pyspark.daemon produces 
> bogus stdout
> --
>
> Key: SPARK-23240
> URL: https://issues.apache.org/jira/browse/SPARK-23240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Bruce Robbins
>Priority: Minor
>
> Environmental issues or site-local customizations (i.e., sitecustomize.py 
> present in the python install directory) can interfere with daemon.py’s 
> output to stdout. PythonWorkerFactory produces unhelpful messages when this 
> happens, causing some head scratching before the actual issue is determined.
> Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory uses the output as the daemon’s port number and ends up 
> throwing an exception when creating the socket:
> {noformat}
> java.lang.IllegalArgumentException: port out of range:1819239265
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:188)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78)
> {noformat}
> Case #2: No data in pyspark.daemon’s stdout. In this case, 
> PythonWorkerFactory throws an EOFException exception reading the from the 
> Process input stream.
> The second case is somewhat less mysterious than the first, because 
> PythonWorkerFactory also displays the stderr from the python process.
> When there is unexpected or missing output in pyspark.daemon’s stdout, 
> PythonWorkerFactory should say so.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen

2018-02-10 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359420#comment-16359420
 ] 

Marco Gaido commented on SPARK-22105:
-

[~WeichenXu123] which is the number of rows for the dataset you tested? Maybe 
the time for generating/compiling the code can be a significant overhead if we 
have few data


> Dataframe has poor performance when computing on many columns with codegen
> --
>
> Key: SPARK-22105
> URL: https://issues.apache.org/jira/browse/SPARK-22105
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> Suppose we have a dataframe with many columns (e.g 100 columns), each column 
> is DoubleType.
> And we need to compute avg on each column. We will find using dataframe avg 
> will be much slower than using RDD.aggregate.
> I observe this issue from this PR: (One pass imputer)
> https://github.com/apache/spark/pull/18902
> I also write a minimal testing code to reproduce this issue, I use computing 
> sum to reproduce this issue:
> https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1
> When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be 
> about 3x slower than `RDD.aggregate`, but if we only compute one column, 
> dataframe avg will be much faster than `RDD.aggregate`.
> The reason of this issue, should be the defact in dataframe codegen. Codegen 
> will inline everything and generate large code block. When the column number 
> is large (e.g 100 columns), the codegen size will be too large, which cause 
> jvm failed to JIT and fall back to byte code interpretation.
> This PR should address this issue:
> https://github.com/apache/spark/pull/19082
> But we need more performance test against some code in ML after above PR 
> merged, to check whether this issue is actually fixed.
> This JIRA used to track this performance issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23374) Checkstyle/Scalastyle only work from top level build

2018-02-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23374.
---
Resolution: Not A Problem

Agree, that's the right way to invoke a target on one module in a multi-module 
project

> Checkstyle/Scalastyle only work from top level build
> 
>
> Key: SPARK-23374
> URL: https://issues.apache.org/jira/browse/SPARK-23374
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rob Vesse
>Priority: Trivial
>
> The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML 
> configs for the style rule locations that are only valid relative to the top 
> level POM.  Therefore if you try and do a {{mvn verify}} in an individual 
> module you get the following error:
> {noformat}
> [ERROR] Failed to execute goal 
> org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project 
> spark-mesos_2.11: Failed during scalastyle execution: Unable to find 
> configuration file at location scalastyle-config.xml
> {noformat}
> As the paths are hardcoded in XML and don't use Maven properties you can't 
> override these settings so you can't style check a single module which makes 
> doing style checking require a full project {{mvn verify}} which is not ideal.
> By introducing Maven properties for these two paths it would become possible 
> to run checks on a single module like so:
> {noformat}
> mvn verify -Dscalastyle.location=../scalastyle-config.xml
> {noformat}
> Obviously the override would need to vary depending on the specific module 
> you are trying to run it against but this would be a relatively simply change 
> that would streamline dev workflows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations

2018-02-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359501#comment-16359501
 ] 

Sean Owen commented on SPARK-23381:
---

... what problem does this cause?

> Murmur3 hash generates a different value from other implementations
> ---
>
> Key: SPARK-23381
> URL: https://issues.apache.org/jira/browse/SPARK-23381
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shintaro Murakami
>Priority: Major
>
> Murmur3 hash generates a different value from the original and other 
> implementations (like Scala standard library and Guava or so) when the length 
> of a bytes array is not multiple of 4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23344.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20520
[https://github.com/apache/spark/pull/20520]

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23344) Add KMeans distanceMeasure param to PySpark

2018-02-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23344:
-

Assignee: Marco Gaido

> Add KMeans distanceMeasure param to PySpark
> ---
>
> Key: SPARK-23344
> URL: https://issues.apache.org/jira/browse/SPARK-23344
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. 
> We should add it also to the Python interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23360) SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath

2018-02-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23360.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20559
[https://github.com/apache/spark/pull/20559]

> SparkSession.createDataFrame timestamps can be incorrect with non-Arrow 
> codepath
> 
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
> Fix For: 2.3.0
>
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23360) SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath

2018-02-10 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23360:


Assignee: Takuya Ueshin

> SparkSession.createDataFrame timestamps can be incorrect with non-Arrow 
> codepath
> 
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.0
>
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359500#comment-16359500
 ] 

Sean Owen commented on SPARK-23370:
---

Overhead for the applications that will use Oracle from Spark. You're proposing 
making all Oracle connections query a table for schema instead of getting it 
the usual way from the JDBC driver. What's the downside?

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Minor
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining

2018-02-10 Thread Tristan Stevens (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359623#comment-16359623
 ] 

Tristan Stevens commented on SPARK-10697:
-

[~srowen] a big +1 from me to implementing this. Without Lift, it becomes very 
difficult to assess whether a rule is even worth looking at.

As an example, using the dataset from Wikipedia, we get the following output 
currently: 

{{from pyspark.ml.fpm import FPGrowth}}{{df = spark.createDataFrame([}}
 {{ (0, ["milk", "bread"]),}}
 {{ (1, ["butter"]),}}
 {{ (2, ["beer", "diapers"]),}}
 {{ (3, ["milk", "bread", "butter"] ),}}
 {{ (4, ["bread"],)}}
 {{], ["id", "items"])}}{{fpGrowth = FPGrowth(itemsCol="items", minSupport=0.2, 
minConfidence=0.2)}}
 {{model = fpGrowth.fit(df)}}{{# Display frequent itemsets.}}
 {{model.freqItemsets.show()}}{{# Display generated association rules.}}

|items|freq|
|[milk]|2|
|[milk, butter]|1|
|[milk, butter, br...|1|
|[milk, bread]|2|
|[diapers]|1|
|[diapers, beer]|1|
|[bread]|3|
|[butter]|2|
|[butter, bread]|1|
|[beer]|1|


{{model.associationRules.show()}}
 
|antecedent|consequent|confidence|
|[milk]|[butter]|0.5|
|[milk]|[bread]|1.0|
|[milk, butter]|[bread]|1.0|
|[beer]|[diapers]|1.0|
|[bread]|[milk]|0.|
|[bread]|[butter]|0.|
|[milk, bread]|[butter]|0.5|
|[diapers]|[beer]|1.0|
|[butter, bread]|[milk]|1.0|
|[butter]|[milk]|0.5|
|[butter]|[bread]|0.5|


 However this misses the detail that milk->bread is much less interesting than 
diapers->beer. When we add in lift we get the following:
 
|antecedent|consequent|confidence|lift|
|[milk]|[butter]|0.5|1.25|
|[milk]|[bread]|1.0|1.|
|[milk, butter]|[bread]|1.0|1.|
|[beer]|[diapers]|1.0|5.0|
|[bread]|[milk]|0.|1.|
|[bread]|[butter]|0.|0.8333|
|[milk, bread]|[butter]|0.5|1.25|
|[diapers]|[beer]|1.0|5.0|
|[butter, bread]|[milk]|1.0|2.5|
|[butter]|[milk]|0.5|1.25|
|[butter]|[bread]|0.5|0.8333|



So the proposal would be to add Lift to the Rules class, calculated by
 {{lift( x => y ) = sup(x U y) / (sup( x ) * sup( y ))}}

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

57 matches

Mail list logo