[jira] [Commented] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359720#comment-16359720 ] Apache Spark commented on SPARK-23382: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/20570 > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > -- > > Key: SPARK-23382 > URL: https://issues.apache.org/jira/browse/SPARK-23382 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > Specific reasons, please refer to > https://issues.apache.org/jira/browse/SPARK-23024 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23382: Assignee: (was: Apache Spark) > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > -- > > Key: SPARK-23382 > URL: https://issues.apache.org/jira/browse/SPARK-23382 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > Specific reasons, please refer to > https://issues.apache.org/jira/browse/SPARK-23024 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23382: Assignee: Apache Spark > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > -- > > Key: SPARK-23382 > URL: https://issues.apache.org/jira/browse/SPARK-23382 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > > Spark Streaming ui about the contents of the form need to have hidden and > show features, when the table records very much. > Specific reasons, please refer to > https://issues.apache.org/jira/browse/SPARK-23024 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359810#comment-16359810 ] Harel Ben Attia edited comment on SPARK-10912 at 2/11/18 6:25 AM: -- We would really be glad to see this happening as well, without the need to change spark's source code. Also, externalizing the array to a configuration properly in metrics.properties would be best (or auto-supporting each used FileSystem schema obviously, but this might include bigger changes to the registration logic, so it's not necessary). btw, [~srowen] - The main benefit of getting it from spark itself is that it provides this filesystem data on a per-executor/driver basis, and not aggregated, allowing for much better debugging and troubleshooting. was (Author: harelba): We would really be glad to see this happening as well, without the need to change spark's source code. Also, externalizing the array to a configuration properly in metrics.properties would be best (or auto-supporting each used FileSystem schema obviously, but this might include bigger changes to the registration logic, so it's not necessary). > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10912) Improve Spark metrics executor.filesystem
[ https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359810#comment-16359810 ] Harel Ben Attia commented on SPARK-10912: - We would really be glad to see this happening as well, without the need to change spark's source code. Also, externalizing the array to a configuration properly in metrics.properties would be best (or auto-supporting each used FileSystem schema obviously, but this might include bigger changes to the registration logic, so it's not necessary). > Improve Spark metrics executor.filesystem > - > > Key: SPARK-10912 > URL: https://issues.apache.org/jira/browse/SPARK-10912 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.5.0 >Reporter: Yongjia Wang >Priority: Minor > Attachments: s3a_metrics.patch > > > In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: > "hdfs" and "file". I started using s3 as the persistent storage with Spark > standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. > The 'file' metric appears to be only for driver reading local file, it would > be nice to also report shuffle read/write metrics, so it can help with > optimization. > I think these 2 things (s3 and shuffle) are very useful and cover all the > missing information about Spark IO especially for s3 setup. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23382) Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much.
guoxiaolongzte created SPARK-23382: -- Summary: Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much. Key: SPARK-23382 URL: https://issues.apache.org/jira/browse/SPARK-23382 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0 Reporter: guoxiaolongzte Spark Streaming ui about the contents of the form need to have hidden and show features, when the table records very much. Specific reasons, please refer to https://issues.apache.org/jira/browse/SPARK-23024 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options
[ https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-23383: - Description: {code:java} ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 +++ dirname ./dev/make-distribution.sh ++ cd ./dev/.. ++ pwd + SPARK_HOME=/Users/Kent/Documents/spark + DISTDIR=/Users/Kent/Documents/spark/dist + MAKE_TGZ=false + MAKE_PIP=false + MAKE_R=false + NAME=none + MVN=/Users/Kent/Documents/spark/build/mvn + (( 5 )) + case $1 in + NAME=ne-1.0.0-SNAPSHOT + shift + shift + (( 3 )) + case $1 in + break + '[' -z /Users/Kent/.jenv/candidates/java/current ']' + '[' -z /Users/Kent/.jenv/candidates/java/current ']' ++ command -v git + '[' /usr/local/bin/git ']' ++ git rev-parse --short HEAD + GITREV=98ea6a7 + '[' '!' -z 98ea6a7 ']' + GITREVSTRING=' (git revision 98ea6a7)' + unset GITREV ++ command -v /Users/Kent/Documents/spark/build/mvn + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' ++ /Users/Kent/Documents/spark/build/mvn help:evaluate -Dexpression=project.version xyz --tgz -Phadoop-2.7 ++ grep -v INFO ++ tail -n 1 + VERSION=' -X,--debug Produce execution debug output' {code} It is better to declare the mistakes and exit with usage was: ``` ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 +++ dirname ./dev/make-distribution.sh ++ cd ./dev/.. ++ pwd + SPARK_HOME=/Users/Kent/Documents/spark + DISTDIR=/Users/Kent/Documents/spark/dist + MAKE_TGZ=false + MAKE_PIP=false + MAKE_R=false + NAME=none + MVN=/Users/Kent/Documents/spark/build/mvn + (( 5 )) + case $1 in + NAME=ne-1.0.0-SNAPSHOT + shift + shift + (( 3 )) + case $1 in + break + '[' -z /Users/Kent/.jenv/candidates/java/current ']' + '[' -z /Users/Kent/.jenv/candidates/java/current ']' ++ command -v git + '[' /usr/local/bin/git ']' ++ git rev-parse --short HEAD + GITREV=98ea6a7 + '[' '!' -z 98ea6a7 ']' + GITREVSTRING=' (git revision 98ea6a7)' + unset GITREV ++ command -v /Users/Kent/Documents/spark/build/mvn + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' ++ /Users/Kent/Documents/spark/build/mvn help:evaluate -Dexpression=project.version xyz --tgz -Phadoop-2.7 ++ grep -v INFO ++ tail -n 1 + VERSION=' -X,--debug Produce execution debug output' ``` It is better to declare the mistakes and exit with usage > Make a distribution should exit with usage while detecting wrong options > > > Key: SPARK-23383 > URL: https://issues.apache.org/jira/browse/SPARK-23383 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Priority: Minor > > {code:java} > ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 > +++ dirname ./dev/make-distribution.sh > ++ cd ./dev/.. > ++ pwd > + SPARK_HOME=/Users/Kent/Documents/spark > + DISTDIR=/Users/Kent/Documents/spark/dist > + MAKE_TGZ=false > + MAKE_PIP=false > + MAKE_R=false > + NAME=none > + MVN=/Users/Kent/Documents/spark/build/mvn > + (( 5 )) > + case $1 in > + NAME=ne-1.0.0-SNAPSHOT > + shift > + shift > + (( 3 )) > + case $1 in > + break > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > ++ command -v git > + '[' /usr/local/bin/git ']' > ++ git rev-parse --short HEAD > + GITREV=98ea6a7 > + '[' '!' -z 98ea6a7 ']' > + GITREVSTRING=' (git revision 98ea6a7)' > + unset GITREV > ++ command -v /Users/Kent/Documents/spark/build/mvn > + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' > ++ /Users/Kent/Documents/spark/build/mvn help:evaluate > -Dexpression=project.version xyz --tgz -Phadoop-2.7 > ++ grep -v INFO > ++ tail -n 1 > + VERSION=' -X,--debug Produce execution debug > output' > {code} > It is better to declare the mistakes and exit with usage -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23340) Update ORC to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23340: -- Description: This issue updates Apache ORC dependencies to 1.4.3 released on February 9th. Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more patches (https://s.apache.org/Fll8). was:ORC 1.4.2 is released on January 23rd. This release removes unnecessary dependencies. > Update ORC to 1.4.3 > > > Key: SPARK-23340 > URL: https://issues.apache.org/jira/browse/SPARK-23340 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue updates Apache ORC dependencies to 1.4.3 released on February 9th. > Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 > more patches (https://s.apache.org/Fll8). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI
Lantao Jin created SPARK-23385: -- Summary: Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI Key: SPARK-23385 URL: https://issues.apache.org/jira/browse/SPARK-23385 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.2.1 Reporter: Lantao Jin It would be nice if there was a mechanism to allow to add customized SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) to be registered through SparkConf settings. This would be more flexible when we need display some special information in UI rather than adding the embedded one by one and wait community to merge. I propose to introduce a new configuration option, spark.extraUITabs, that allows customized WebUITab to be specified in SparkConf and registered when SparkUI is created. Here is the proposed documentation for the new option: {quote} A comma-separated list of classes that implement SparkUITab; when initializing SparkUI, instances of these classes will be created and registered to the tabs array in SparkUI. If a class has a two-argument constructor that accepts a SparkUI and AppStatusStore, that constructor will be called; If a class has a single-argument constructor that accepts a SparkUI; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkUI creation will fail with an exception. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI
[ https://issues.apache.org/jira/browse/SPARK-23385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23385: Assignee: (was: Apache Spark) > Allow SparkUITab to be customized adding in SparkConf and loaded when > creating SparkUI > -- > > Key: SPARK-23385 > URL: https://issues.apache.org/jira/browse/SPARK-23385 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Lantao Jin >Priority: Major > > It would be nice if there was a mechanism to allow to add customized > SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) > to be registered through SparkConf settings. This would be more flexible when > we need display some special information in UI rather than adding the > embedded one by one and wait community to merge. > I propose to introduce a new configuration option, spark.extraUITabs, that > allows customized WebUITab to be specified in SparkConf and registered when > SparkUI is created. Here is the proposed documentation for the new option: > {quote} > A comma-separated list of classes that implement SparkUITab; when > initializing SparkUI, instances of these classes will be created and > registered to the tabs array in SparkUI. If a class has a two-argument > constructor that accepts a SparkUI and AppStatusStore, that constructor will > be called; If a class has a single-argument constructor that accepts a > SparkUI; otherwise, a zero-argument constructor will be called. If no valid > constructor can be found, the SparkUI creation will fail with an exception. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI
[ https://issues.apache.org/jira/browse/SPARK-23385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359825#comment-16359825 ] Apache Spark commented on SPARK-23385: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/20574 > Allow SparkUITab to be customized adding in SparkConf and loaded when > creating SparkUI > -- > > Key: SPARK-23385 > URL: https://issues.apache.org/jira/browse/SPARK-23385 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Lantao Jin >Priority: Major > > It would be nice if there was a mechanism to allow to add customized > SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) > to be registered through SparkConf settings. This would be more flexible when > we need display some special information in UI rather than adding the > embedded one by one and wait community to merge. > I propose to introduce a new configuration option, spark.extraUITabs, that > allows customized WebUITab to be specified in SparkConf and registered when > SparkUI is created. Here is the proposed documentation for the new option: > {quote} > A comma-separated list of classes that implement SparkUITab; when > initializing SparkUI, instances of these classes will be created and > registered to the tabs array in SparkUI. If a class has a two-argument > constructor that accepts a SparkUI and AppStatusStore, that constructor will > be called; If a class has a single-argument constructor that accepts a > SparkUI; otherwise, a zero-argument constructor will be called. If no valid > constructor can be found, the SparkUI creation will fail with an exception. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23385) Allow SparkUITab to be customized adding in SparkConf and loaded when creating SparkUI
[ https://issues.apache.org/jira/browse/SPARK-23385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23385: Assignee: Apache Spark > Allow SparkUITab to be customized adding in SparkConf and loaded when > creating SparkUI > -- > > Key: SPARK-23385 > URL: https://issues.apache.org/jira/browse/SPARK-23385 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > It would be nice if there was a mechanism to allow to add customized > SparkUITab (embedded like Jobs, Stages, Storage, Environment, Executors,...) > to be registered through SparkConf settings. This would be more flexible when > we need display some special information in UI rather than adding the > embedded one by one and wait community to merge. > I propose to introduce a new configuration option, spark.extraUITabs, that > allows customized WebUITab to be specified in SparkConf and registered when > SparkUI is created. Here is the proposed documentation for the new option: > {quote} > A comma-separated list of classes that implement SparkUITab; when > initializing SparkUI, instances of these classes will be created and > registered to the tabs array in SparkUI. If a class has a two-argument > constructor that accepts a SparkUI and AppStatusStore, that constructor will > be called; If a class has a single-argument constructor that accepts a > SparkUI; otherwise, a zero-argument constructor will be called. If no valid > constructor can be found, the SparkUI creation will fail with an exception. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23386) Enable direct application links before replay
[ https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23386: Assignee: Apache Spark > Enable direct application links before replay > - > > Key: SPARK-23386 > URL: https://issues.apache.org/jira/browse/SPARK-23386 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Assignee: Apache Spark >Priority: Major > > In a deployment with multiple 10K of large event logs it may take *many > hours* until all logs are replayed. Most our users use SHS by clicking on a > link in a client log in case of an error. Direct links currently don't work > until the event log is processed in a replay thread. This Jira proposes to > link appid to the event logs already during scan, without a full replay. This > makes on-demand retrievals accessible almost immediately upon SHS start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23386) Enable direct application links before replay
[ https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23386: Assignee: (was: Apache Spark) > Enable direct application links before replay > - > > Key: SPARK-23386 > URL: https://issues.apache.org/jira/browse/SPARK-23386 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Priority: Major > > In a deployment with multiple 10K of large event logs it may take *many > hours* until all logs are replayed. Most our users use SHS by clicking on a > link in a client log in case of an error. Direct links currently don't work > until the event log is processed in a replay thread. This Jira proposes to > link appid to the event logs already during scan, without a full replay. This > makes on-demand retrievals accessible almost immediately upon SHS start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23386) Enable direct application links before replay
[ https://issues.apache.org/jira/browse/SPARK-23386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359831#comment-16359831 ] Apache Spark commented on SPARK-23386: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/20575 > Enable direct application links before replay > - > > Key: SPARK-23386 > URL: https://issues.apache.org/jira/browse/SPARK-23386 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Priority: Major > > In a deployment with multiple 10K of large event logs it may take *many > hours* until all logs are replayed. Most our users use SHS by clicking on a > link in a client log in case of an error. Direct links currently don't work > until the event log is processed in a replay thread. This Jira proposes to > link appid to the event logs already during scan, without a full replay. This > makes on-demand retrievals accessible almost immediately upon SHS start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359690#comment-16359690 ] Shintaro Murakami commented on SPARK-23381: --- FeatureHasher in MLLib uses Murmur3 in hashing indices. If I made an online prediction in another environment like C++ predict server, the indices do not match and can not predict correctly. > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options
[ https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359763#comment-16359763 ] Apache Spark commented on SPARK-23383: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/20571 > Make a distribution should exit with usage while detecting wrong options > > > Key: SPARK-23383 > URL: https://issues.apache.org/jira/browse/SPARK-23383 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Priority: Minor > > {code:java} > ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 > +++ dirname ./dev/make-distribution.sh > ++ cd ./dev/.. > ++ pwd > + SPARK_HOME=/Users/Kent/Documents/spark > + DISTDIR=/Users/Kent/Documents/spark/dist > + MAKE_TGZ=false > + MAKE_PIP=false > + MAKE_R=false > + NAME=none > + MVN=/Users/Kent/Documents/spark/build/mvn > + (( 5 )) > + case $1 in > + NAME=ne-1.0.0-SNAPSHOT > + shift > + shift > + (( 3 )) > + case $1 in > + break > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > ++ command -v git > + '[' /usr/local/bin/git ']' > ++ git rev-parse --short HEAD > + GITREV=98ea6a7 > + '[' '!' -z 98ea6a7 ']' > + GITREVSTRING=' (git revision 98ea6a7)' > + unset GITREV > ++ command -v /Users/Kent/Documents/spark/build/mvn > + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' > ++ /Users/Kent/Documents/spark/build/mvn help:evaluate > -Dexpression=project.version xyz --tgz -Phadoop-2.7 > ++ grep -v INFO > ++ tail -n 1 > + VERSION=' -X,--debug Produce execution debug > output' > {code} > It is better to declare the mistakes and exit with usage -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options
[ https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23383: Assignee: Apache Spark > Make a distribution should exit with usage while detecting wrong options > > > Key: SPARK-23383 > URL: https://issues.apache.org/jira/browse/SPARK-23383 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > > {code:java} > ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 > +++ dirname ./dev/make-distribution.sh > ++ cd ./dev/.. > ++ pwd > + SPARK_HOME=/Users/Kent/Documents/spark > + DISTDIR=/Users/Kent/Documents/spark/dist > + MAKE_TGZ=false > + MAKE_PIP=false > + MAKE_R=false > + NAME=none > + MVN=/Users/Kent/Documents/spark/build/mvn > + (( 5 )) > + case $1 in > + NAME=ne-1.0.0-SNAPSHOT > + shift > + shift > + (( 3 )) > + case $1 in > + break > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > ++ command -v git > + '[' /usr/local/bin/git ']' > ++ git rev-parse --short HEAD > + GITREV=98ea6a7 > + '[' '!' -z 98ea6a7 ']' > + GITREVSTRING=' (git revision 98ea6a7)' > + unset GITREV > ++ command -v /Users/Kent/Documents/spark/build/mvn > + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' > ++ /Users/Kent/Documents/spark/build/mvn help:evaluate > -Dexpression=project.version xyz --tgz -Phadoop-2.7 > ++ grep -v INFO > ++ tail -n 1 > + VERSION=' -X,--debug Produce execution debug > output' > {code} > It is better to declare the mistakes and exit with usage -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options
[ https://issues.apache.org/jira/browse/SPARK-23383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23383: Assignee: (was: Apache Spark) > Make a distribution should exit with usage while detecting wrong options > > > Key: SPARK-23383 > URL: https://issues.apache.org/jira/browse/SPARK-23383 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Kent Yao >Priority: Minor > > {code:java} > ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 > +++ dirname ./dev/make-distribution.sh > ++ cd ./dev/.. > ++ pwd > + SPARK_HOME=/Users/Kent/Documents/spark > + DISTDIR=/Users/Kent/Documents/spark/dist > + MAKE_TGZ=false > + MAKE_PIP=false > + MAKE_R=false > + NAME=none > + MVN=/Users/Kent/Documents/spark/build/mvn > + (( 5 )) > + case $1 in > + NAME=ne-1.0.0-SNAPSHOT > + shift > + shift > + (( 3 )) > + case $1 in > + break > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > + '[' -z /Users/Kent/.jenv/candidates/java/current ']' > ++ command -v git > + '[' /usr/local/bin/git ']' > ++ git rev-parse --short HEAD > + GITREV=98ea6a7 > + '[' '!' -z 98ea6a7 ']' > + GITREVSTRING=' (git revision 98ea6a7)' > + unset GITREV > ++ command -v /Users/Kent/Documents/spark/build/mvn > + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' > ++ /Users/Kent/Documents/spark/build/mvn help:evaluate > -Dexpression=project.version xyz --tgz -Phadoop-2.7 > ++ grep -v INFO > ++ tail -n 1 > + VERSION=' -X,--debug Produce execution debug > output' > {code} > It is better to declare the mistakes and exit with usage -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23340) Update ORC to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23340: -- Summary: Update ORC to 1.4.3 (was: Update ORC to 1.4.2) > Update ORC to 1.4.3 > > > Key: SPARK-23340 > URL: https://issues.apache.org/jira/browse/SPARK-23340 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > ORC 1.4.2 is released on January 23rd. This release removes unnecessary > dependencies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23386) Enable direct application links before replay
Gera Shegalov created SPARK-23386: - Summary: Enable direct application links before replay Key: SPARK-23386 URL: https://issues.apache.org/jira/browse/SPARK-23386 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.2.1 Reporter: Gera Shegalov In a deployment with multiple 10K of large event logs it may take *many hours* until all logs are replayed. Most our users use SHS by clicking on a link in a client log in case of an error. Direct links currently don't work until the event log is processed in a replay thread. This Jira proposes to link appid to the event logs already during scan, without a full replay. This makes on-demand retrievals accessible almost immediately upon SHS start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23383) Make a distribution should exit with usage while detecting wrong options
Kent Yao created SPARK-23383: Summary: Make a distribution should exit with usage while detecting wrong options Key: SPARK-23383 URL: https://issues.apache.org/jira/browse/SPARK-23383 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.1 Reporter: Kent Yao ``` ./dev/make-distribution.sh --name ne-1.0.0-SNAPSHOT xyz --tgz -Phadoop-2.7 +++ dirname ./dev/make-distribution.sh ++ cd ./dev/.. ++ pwd + SPARK_HOME=/Users/Kent/Documents/spark + DISTDIR=/Users/Kent/Documents/spark/dist + MAKE_TGZ=false + MAKE_PIP=false + MAKE_R=false + NAME=none + MVN=/Users/Kent/Documents/spark/build/mvn + (( 5 )) + case $1 in + NAME=ne-1.0.0-SNAPSHOT + shift + shift + (( 3 )) + case $1 in + break + '[' -z /Users/Kent/.jenv/candidates/java/current ']' + '[' -z /Users/Kent/.jenv/candidates/java/current ']' ++ command -v git + '[' /usr/local/bin/git ']' ++ git rev-parse --short HEAD + GITREV=98ea6a7 + '[' '!' -z 98ea6a7 ']' + GITREVSTRING=' (git revision 98ea6a7)' + unset GITREV ++ command -v /Users/Kent/Documents/spark/build/mvn + '[' '!' /Users/Kent/Documents/spark/build/mvn ']' ++ /Users/Kent/Documents/spark/build/mvn help:evaluate -Dexpression=project.version xyz --tgz -Phadoop-2.7 ++ grep -v INFO ++ tail -n 1 + VERSION=' -X,--debug Produce execution debug output' ``` It is better to declare the mistakes and exit with usage -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17147: Assignee: Apache Spark > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Assignee: Apache Spark >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359786#comment-16359786 ] Apache Spark commented on SPARK-17147: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/20572 > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17147: Assignee: (was: Apache Spark) > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > -- > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 >Reporter: Robert Conrad >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359788#comment-16359788 ] Harleen Singh Mann commented on SPARK-23370: This goes as far as I understand: * JDBC driver: Once we create the result set object using the jdbc driver, it will contain all the actual data as well as the metadata for the concerned DB table. * Query additional table (all_tab_rows): This would entail creating another result set that will capture the metadata for the concerned DB table as data (rows). Overhead: ** Connection: None. Since it will use pooling ** Retrieving result: Low impact. Since we will push down the predicate to the DB to filter data only for the concerned table I believe that "all_tab_rows" table should be queried on the driver and broadcast to the executors. Does this make sense? Can we get some inputs from someone else as well? > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Minor > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining
[ https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359639#comment-16359639 ] Sean Owen commented on SPARK-10697: --- Yes, I think it's OK to add. Go ahead and propose a PR. Lift is confidence, normalized for the prior probability of observing the antecedent at all. Yes it is the right tool when evaluating rules vs each other for interest. It's a likelihood ratio. Confidence is of interest when you know you have the antecedent (e.g. already added those items to a basket) and want to know about consequents. There the prior probability would be irrelevant. You can compute lift from confidence but it's extra work and so does make some sense to compute this along the way. > Lift Calculation in Association Rule mining > --- > > Key: SPARK-10697 > URL: https://issues.apache.org/jira/browse/SPARK-10697 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yashwanth Kumar >Priority: Minor > > Lift is to be calculated for Association rule mining in > AssociationRules.scala under FPM. > Lift is a measure of the performance of a Association rules. > Adding lift will help to compare the model efficiency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
guoxiaolongzte created SPARK-23384: -- Summary: When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui. Key: SPARK-23384 URL: https://issues.apache.org/jira/browse/SPARK-23384 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.0 Reporter: guoxiaolongzte When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui. It is a bug. fix before: fix after: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-23384: --- Attachment: 2.png 1.png > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. > > > Key: SPARK-23384 > URL: https://issues.apache.org/jira/browse/SPARK-23384 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. It is a bug. > fix before: > > fix after: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-23384: --- Description: When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui. It is a bug. fix before: !1.png! fix after: !2.png! was: When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui. It is a bug. fix before: fix after: > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. > > > Key: SPARK-23384 > URL: https://issues.apache.org/jira/browse/SPARK-23384 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. It is a bug. > fix before: !1.png! > fix after: > !2.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23384: Assignee: Apache Spark > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. > > > Key: SPARK-23384 > URL: https://issues.apache.org/jira/browse/SPARK-23384 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > Attachments: 1.png, 2.png > > > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. It is a bug. > fix before: !1.png! > fix after: > !2.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359820#comment-16359820 ] Apache Spark commented on SPARK-23384: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/20573 > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. > > > Key: SPARK-23384 > URL: https://issues.apache.org/jira/browse/SPARK-23384 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. It is a bug. > fix before: !1.png! > fix after: > !2.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23384) When it has no incomplete(completed) applications found, the last updated time is not formatted and client local time zone is not show in history server web ui.
[ https://issues.apache.org/jira/browse/SPARK-23384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23384: Assignee: (was: Apache Spark) > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. > > > Key: SPARK-23384 > URL: https://issues.apache.org/jira/browse/SPARK-23384 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > When it has no incomplete(completed) applications found, the last updated > time is not formatted and client local time zone is not show in history > server web ui. It is a bug. > fix before: !1.png! > fix after: > !2.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is mismatched
Hyukjin Kwon created SPARK-23380: Summary: Make toPandas fall back to Arrow optimization disabled when schema is mismatched Key: SPARK-23380 URL: https://issues.apache.org/jira/browse/SPARK-23380 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Seems we can check the schema ahead and fall back in toPandas. Please see this case below: {code} df = spark.createDataFrame([[{'a': 1}]]) spark.conf.set("spark.sql.execution.arrow.enabled", "false") df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", "true") df.toPandas() {code} {code} ... py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython. ... java.lang.UnsupportedOperationException: Unsupported data type: map{code} In case of {{createDataFrame}}, we fall back to make this at least working even though the optimisation is disabled. {code} df = spark.createDataFrame([[{'a': 1}]]) spark.conf.set("spark.sql.execution.arrow.enabled", "false") pdf = df.toPandas() spark.createDataFrame(pdf).show() spark.conf.set("spark.sql.execution.arrow.enabled", "true") spark.createDataFrame(pdf).show() {code} {code} ... ... UserWarning: Arrow will not be used in createDataFrame: Error inferring Arrow type ... ++ | _1| ++ |[a -> 1]| ++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23380: Assignee: Apache Spark > Make toPandas fall back to Arrow optimization disabled when schema is not > supported in the Arrow optimization > -- > > Key: SPARK-23380 > URL: https://issues.apache.org/jira/browse/SPARK-23380 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Seems we can check the schema ahead and fall back in toPandas. > Please see this case below: > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > ... > py4j.protocol.Py4JJavaError: An error occurred while calling > o42.collectAsArrowToPython. > ... > java.lang.UnsupportedOperationException: Unsupported data type: > map> {code} > In case of {{createDataFrame}}, we fall back to make this at least working > even though the optimisation is disabled. > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > pdf = df.toPandas() > spark.createDataFrame(pdf).show() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > spark.createDataFrame(pdf).show() > {code} > {code} > ... > ... UserWarning: Arrow will not be used in createDataFrame: Error inferring > Arrow type ... > ++ > | _1| > ++ > |[a -> 1]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23380: - Summary: Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization (was: Make toPandas fall back to Arrow optimization disabled when schema is mismatched) > Make toPandas fall back to Arrow optimization disabled when schema is not > supported in the Arrow optimization > -- > > Key: SPARK-23380 > URL: https://issues.apache.org/jira/browse/SPARK-23380 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Seems we can check the schema ahead and fall back in toPandas. > Please see this case below: > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > ... > py4j.protocol.Py4JJavaError: An error occurred while calling > o42.collectAsArrowToPython. > ... > java.lang.UnsupportedOperationException: Unsupported data type: > map> {code} > In case of {{createDataFrame}}, we fall back to make this at least working > even though the optimisation is disabled. > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > pdf = df.toPandas() > spark.createDataFrame(pdf).show() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > spark.createDataFrame(pdf).show() > {code} > {code} > ... > ... UserWarning: Arrow will not be used in createDataFrame: Error inferring > Arrow type ... > ++ > | _1| > ++ > |[a -> 1]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23377: Assignee: Apache Spark > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Apache Spark >Priority: Major > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359340#comment-16359340 ] Apache Spark commented on SPARK-23377: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20566 > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23377: Assignee: (was: Apache Spark) > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23380: Assignee: (was: Apache Spark) > Make toPandas fall back to Arrow optimization disabled when schema is not > supported in the Arrow optimization > -- > > Key: SPARK-23380 > URL: https://issues.apache.org/jira/browse/SPARK-23380 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Seems we can check the schema ahead and fall back in toPandas. > Please see this case below: > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > ... > py4j.protocol.Py4JJavaError: An error occurred while calling > o42.collectAsArrowToPython. > ... > java.lang.UnsupportedOperationException: Unsupported data type: > map> {code} > In case of {{createDataFrame}}, we fall back to make this at least working > even though the optimisation is disabled. > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > pdf = df.toPandas() > spark.createDataFrame(pdf).show() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > spark.createDataFrame(pdf).show() > {code} > {code} > ... > ... UserWarning: Arrow will not be used in createDataFrame: Error inferring > Arrow type ... > ++ > | _1| > ++ > |[a -> 1]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23380) Make toPandas fall back to Arrow optimization disabled when schema is not supported in the Arrow optimization
[ https://issues.apache.org/jira/browse/SPARK-23380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359349#comment-16359349 ] Apache Spark commented on SPARK-23380: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20567 > Make toPandas fall back to Arrow optimization disabled when schema is not > supported in the Arrow optimization > -- > > Key: SPARK-23380 > URL: https://issues.apache.org/jira/browse/SPARK-23380 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Seems we can check the schema ahead and fall back in toPandas. > Please see this case below: > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > ... > py4j.protocol.Py4JJavaError: An error occurred while calling > o42.collectAsArrowToPython. > ... > java.lang.UnsupportedOperationException: Unsupported data type: > map> {code} > In case of {{createDataFrame}}, we fall back to make this at least working > even though the optimisation is disabled. > {code} > df = spark.createDataFrame([[{'a': 1}]]) > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > pdf = df.toPandas() > spark.createDataFrame(pdf).show() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > spark.createDataFrame(pdf).show() > {code} > {code} > ... > ... UserWarning: Arrow will not be used in createDataFrame: Error inferring > Arrow type ... > ++ > | _1| > ++ > |[a -> 1]| > ++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113
[ https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359429#comment-16359429 ] Kazuaki Ishizaki commented on SPARK-23310: -- got it, thanks > Perf regression introduced by SPARK-21113 > - > > Key: SPARK-23310 > URL: https://issues.apache.org/jira/browse/SPARK-23310 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yin Huai >Assignee: Sital Kedia >Priority: Blocker > Fix For: 2.3.0 > > > While running all TPC-DS queries with SF set to 1000, we noticed that Q95 > (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql) > has noticeable regression (11%). After looking into it, we found that the > regression was introduced by SPARK-21113. Specially, ReadAheadInputStream > gets lock congestion. After setting > spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression > disappear and the overall performance of all TPC-DS queries has improved. > > I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to > false by default for Spark 2.3 and re-enable it after addressing the lock > congestion issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shintaro Murakami updated SPARK-23381: -- Summary: Murmur3 hash generates a different value from other implementations (was: Murmur3 hash generates a different value ) > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23381) Murmur3 hash generates a different value
Shintaro Murakami created SPARK-23381: - Summary: Murmur3 hash generates a different value Key: SPARK-23381 URL: https://issues.apache.org/jira/browse/SPARK-23381 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: Shintaro Murakami Murmur3 hash generates a different value from the original and other implementations (like Scala standard library and Guava or so) when the length of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23381: Assignee: (was: Apache Spark) > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23381: Assignee: Apache Spark > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Assignee: Apache Spark >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359445#comment-16359445 ] Apache Spark commented on SPARK-23381: -- User 'mrkm4ntr' has created a pull request for this issue: https://github.com/apache/spark/pull/20568 > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23240) PythonWorkerFactory issues unhelpful message when pyspark.daemon produces bogus stdout
[ https://issues.apache.org/jira/browse/SPARK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359568#comment-16359568 ] Bruce Robbins commented on SPARK-23240: --- A little background. A Spark installation had a Python sitecustomize.py like this: {code:java} try: import flotilla except ImportError as e: print e{code} (flotilla is not the real python module, I just use that as an example). Because flotilla was not installed on the user's cluster, the first output in daemon's stdout was: {noformat} No module named flotilla{noformat} In fact, this is what I get when I run pyspark.daemon with this sitecustomize.py installed: {noformat} bash-3.2$ python -m pyspark.daemon python -m pyspark.daemon No module named flotilla ^@^@\325{noformat} Therefore, PythonWorkerFactory.startDaemon reads 'No m', or 0x4e6f206d or 1315905645, as the port number. Here's what happens when I run a pyspark action with the above sitecustomize.py installed: {noformat} >>> text_file = sc.textFile("/Users/bruce/ncdc_gsod").count() odule named flotilla 18/02/10 09:44:27 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.IllegalArgumentException: port out of range:1315905645 at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) at java.net.InetSocketAddress.(InetSocketAddress.java:188) at java.net.Socket.(Socket.java:244){noformat} > PythonWorkerFactory issues unhelpful message when pyspark.daemon produces > bogus stdout > -- > > Key: SPARK-23240 > URL: https://issues.apache.org/jira/browse/SPARK-23240 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Bruce Robbins >Priority: Minor > > Environmental issues or site-local customizations (i.e., sitecustomize.py > present in the python install directory) can interfere with daemon.py’s > output to stdout. PythonWorkerFactory produces unhelpful messages when this > happens, causing some head scratching before the actual issue is determined. > Case #1: Extraneous data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory uses the output as the daemon’s port number and ends up > throwing an exception when creating the socket: > {noformat} > java.lang.IllegalArgumentException: port out of range:1819239265 > at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143) > at java.net.InetSocketAddress.(InetSocketAddress.java:188) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:78) > {noformat} > Case #2: No data in pyspark.daemon’s stdout. In this case, > PythonWorkerFactory throws an EOFException exception reading the from the > Process input stream. > The second case is somewhat less mysterious than the first, because > PythonWorkerFactory also displays the stderr from the python process. > When there is unexpected or missing output in pyspark.daemon’s stdout, > PythonWorkerFactory should say so. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen
[ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359420#comment-16359420 ] Marco Gaido commented on SPARK-22105: - [~WeichenXu123] which is the number of rows for the dataset you tested? Maybe the time for generating/compiling the code can be a significant overhead if we have few data > Dataframe has poor performance when computing on many columns with codegen > -- > > Key: SPARK-22105 > URL: https://issues.apache.org/jira/browse/SPARK-22105 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Minor > > Suppose we have a dataframe with many columns (e.g 100 columns), each column > is DoubleType. > And we need to compute avg on each column. We will find using dataframe avg > will be much slower than using RDD.aggregate. > I observe this issue from this PR: (One pass imputer) > https://github.com/apache/spark/pull/18902 > I also write a minimal testing code to reproduce this issue, I use computing > sum to reproduce this issue: > https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1 > When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be > about 3x slower than `RDD.aggregate`, but if we only compute one column, > dataframe avg will be much faster than `RDD.aggregate`. > The reason of this issue, should be the defact in dataframe codegen. Codegen > will inline everything and generate large code block. When the column number > is large (e.g 100 columns), the codegen size will be too large, which cause > jvm failed to JIT and fall back to byte code interpretation. > This PR should address this issue: > https://github.com/apache/spark/pull/19082 > But we need more performance test against some code in ML after above PR > merged, to check whether this issue is actually fixed. > This JIRA used to track this performance issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23374) Checkstyle/Scalastyle only work from top level build
[ https://issues.apache.org/jira/browse/SPARK-23374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23374. --- Resolution: Not A Problem Agree, that's the right way to invoke a target on one module in a multi-module project > Checkstyle/Scalastyle only work from top level build > > > Key: SPARK-23374 > URL: https://issues.apache.org/jira/browse/SPARK-23374 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.1 >Reporter: Rob Vesse >Priority: Trivial > > The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML > configs for the style rule locations that are only valid relative to the top > level POM. Therefore if you try and do a {{mvn verify}} in an individual > module you get the following error: > {noformat} > [ERROR] Failed to execute goal > org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project > spark-mesos_2.11: Failed during scalastyle execution: Unable to find > configuration file at location scalastyle-config.xml > {noformat} > As the paths are hardcoded in XML and don't use Maven properties you can't > override these settings so you can't style check a single module which makes > doing style checking require a full project {{mvn verify}} which is not ideal. > By introducing Maven properties for these two paths it would become possible > to run checks on a single module like so: > {noformat} > mvn verify -Dscalastyle.location=../scalastyle-config.xml > {noformat} > Obviously the override would need to vary depending on the specific module > you are trying to run it against but this would be a relatively simply change > that would streamline dev workflows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23381) Murmur3 hash generates a different value from other implementations
[ https://issues.apache.org/jira/browse/SPARK-23381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359501#comment-16359501 ] Sean Owen commented on SPARK-23381: --- ... what problem does this cause? > Murmur3 hash generates a different value from other implementations > --- > > Key: SPARK-23381 > URL: https://issues.apache.org/jira/browse/SPARK-23381 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shintaro Murakami >Priority: Major > > Murmur3 hash generates a different value from the original and other > implementations (like Scala standard library and Guava or so) when the length > of a bytes array is not multiple of 4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23344) Add KMeans distanceMeasure param to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23344. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20520 [https://github.com/apache/spark/pull/20520] > Add KMeans distanceMeasure param to PySpark > --- > > Key: SPARK-23344 > URL: https://issues.apache.org/jira/browse/SPARK-23344 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. > We should add it also to the Python interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23344) Add KMeans distanceMeasure param to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23344: - Assignee: Marco Gaido > Add KMeans distanceMeasure param to PySpark > --- > > Key: SPARK-23344 > URL: https://issues.apache.org/jira/browse/SPARK-23344 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > SPARK-22119 introduced a new parameter for KMeans, ie. {{distanceMeasure}}. > We should add it also to the Python interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23360) SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath
[ https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23360. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20559 [https://github.com/apache/spark/pull/20559] > SparkSession.createDataFrame timestamps can be incorrect with non-Arrow > codepath > > > Key: SPARK-23360 > URL: https://issues.apache.org/jira/browse/SPARK-23360 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Major > Fix For: 2.3.0 > > > {code:java} > import datetime > import pandas as pd > import os > dt = [datetime.datetime(2015, 10, 31, 22, 30)] > pdf = pd.DataFrame({'time': dt}) > os.environ['TZ'] = 'America/New_York' > df1 = spark.createDataFrame(pdf) > df1.show() > +---+ > | time| > +---+ > |2015-10-31 21:30:00| > +---+ > {code} > Seems to related to this line here: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776] > It appears to be an issue with "tzlocal()" > Wrong: > {code:java} > from_tz = "America/New_York" > to_tz = "tzlocal()" > s.apply(lambda ts: > ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 21:30:00 > Name: time, dtype: datetime64[ns] > {code} > Correct: > {code:java} > from_tz = "America/New_York" > to_tz = "America/New_York" > s.apply( > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 22:30:00 > Name: time, dtype: datetime64[ns] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23360) SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath
[ https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23360: Assignee: Takuya Ueshin > SparkSession.createDataFrame timestamps can be incorrect with non-Arrow > codepath > > > Key: SPARK-23360 > URL: https://issues.apache.org/jira/browse/SPARK-23360 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.0 > > > {code:java} > import datetime > import pandas as pd > import os > dt = [datetime.datetime(2015, 10, 31, 22, 30)] > pdf = pd.DataFrame({'time': dt}) > os.environ['TZ'] = 'America/New_York' > df1 = spark.createDataFrame(pdf) > df1.show() > +---+ > | time| > +---+ > |2015-10-31 21:30:00| > +---+ > {code} > Seems to related to this line here: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776] > It appears to be an issue with "tzlocal()" > Wrong: > {code:java} > from_tz = "America/New_York" > to_tz = "tzlocal()" > s.apply(lambda ts: > ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 21:30:00 > Name: time, dtype: datetime64[ns] > {code} > Correct: > {code:java} > from_tz = "America/New_York" > to_tz = "America/New_York" > s.apply( > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 22:30:00 > Name: time, dtype: datetime64[ns] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359500#comment-16359500 ] Sean Owen commented on SPARK-23370: --- Overhead for the applications that will use Oracle from Spark. You're proposing making all Oracle connections query a table for schema instead of getting it the usual way from the JDBC driver. What's the downside? > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Minor > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining
[ https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359623#comment-16359623 ] Tristan Stevens commented on SPARK-10697: - [~srowen] a big +1 from me to implementing this. Without Lift, it becomes very difficult to assess whether a rule is even worth looking at. As an example, using the dataset from Wikipedia, we get the following output currently: {{from pyspark.ml.fpm import FPGrowth}}{{df = spark.createDataFrame([}} {{ (0, ["milk", "bread"]),}} {{ (1, ["butter"]),}} {{ (2, ["beer", "diapers"]),}} {{ (3, ["milk", "bread", "butter"] ),}} {{ (4, ["bread"],)}} {{], ["id", "items"])}}{{fpGrowth = FPGrowth(itemsCol="items", minSupport=0.2, minConfidence=0.2)}} {{model = fpGrowth.fit(df)}}{{# Display frequent itemsets.}} {{model.freqItemsets.show()}}{{# Display generated association rules.}} |items|freq| |[milk]|2| |[milk, butter]|1| |[milk, butter, br...|1| |[milk, bread]|2| |[diapers]|1| |[diapers, beer]|1| |[bread]|3| |[butter]|2| |[butter, bread]|1| |[beer]|1| {{model.associationRules.show()}} |antecedent|consequent|confidence| |[milk]|[butter]|0.5| |[milk]|[bread]|1.0| |[milk, butter]|[bread]|1.0| |[beer]|[diapers]|1.0| |[bread]|[milk]|0.| |[bread]|[butter]|0.| |[milk, bread]|[butter]|0.5| |[diapers]|[beer]|1.0| |[butter, bread]|[milk]|1.0| |[butter]|[milk]|0.5| |[butter]|[bread]|0.5| However this misses the detail that milk->bread is much less interesting than diapers->beer. When we add in lift we get the following: |antecedent|consequent|confidence|lift| |[milk]|[butter]|0.5|1.25| |[milk]|[bread]|1.0|1.| |[milk, butter]|[bread]|1.0|1.| |[beer]|[diapers]|1.0|5.0| |[bread]|[milk]|0.|1.| |[bread]|[butter]|0.|0.8333| |[milk, bread]|[butter]|0.5|1.25| |[diapers]|[beer]|1.0|5.0| |[butter, bread]|[milk]|1.0|2.5| |[butter]|[milk]|0.5|1.25| |[butter]|[bread]|0.5|0.8333| So the proposal would be to add Lift to the Rules class, calculated by {{lift( x => y ) = sup(x U y) / (sup( x ) * sup( y ))}} > Lift Calculation in Association Rule mining > --- > > Key: SPARK-10697 > URL: https://issues.apache.org/jira/browse/SPARK-10697 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yashwanth Kumar >Priority: Minor > > Lift is to be calculated for Association rule mining in > AssociationRules.scala under FPM. > Lift is a measure of the performance of a Association rules. > Adding lift will help to compare the model efficiency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org