date:20190829

[jira] [Commented] (SPARK-28898) SQL Configuration should be mentioned under Spark SQL in User Guide

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919210#comment-16919210
 ] 

Hyukjin Kwon commented on SPARK-28898:
--

BTW, I wonder if there's a good way to automatically generate the doc from 
{{SQLConf.scala}}

> SQL Configuration should be mentioned under Spark SQL in User Guide
> ---
>
> Key: SPARK-28898
> URL: https://issues.apache.org/jira/browse/SPARK-28898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Now if user gives  set -v; then only End user will come to know about entire 
> list of the spark.sql.XX configuration.
> For the End user unless familiar with Spark code can not use these configure 
> SQL Configuration entire list.
> I feel this should be documented in 3.0 user guide for the entire list of SQL 
> Configuration like 
> [Spark-Streaming|https://spark.apache.org/docs/latest/configuration.html#spark-streaming]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28898) SQL Configuration should be mentioned under Spark SQL in User Guide

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919209#comment-16919209
 ] 

Hyukjin Kwon commented on SPARK-28898:
--

I think we don't need to list entire list up but might be worth noting some 
public and important configurations.

> SQL Configuration should be mentioned under Spark SQL in User Guide
> ---
>
> Key: SPARK-28898
> URL: https://issues.apache.org/jira/browse/SPARK-28898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Now if user gives  set -v; then only End user will come to know about entire 
> list of the spark.sql.XX configuration.
> For the End user unless familiar with Spark code can not use these configure 
> SQL Configuration entire list.
> I feel this should be documented in 3.0 user guide for the entire list of SQL 
> Configuration like 
> [Spark-Streaming|https://spark.apache.org/docs/latest/configuration.html#spark-streaming]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919203#comment-16919203
 ] 

Hyukjin Kwon edited comment on SPARK-28900 at 8/30/19 5:28 AM:
---

Just quick FYI, {{spark-master-test-maven-hadoop-3.2}} builder would need to 
install pyarrow and pandas identically like PR builder ... I believe this could 
be done separately in a separate ticket if it matters
cc [~yumwang] since you reached out to me about this. Once they are properly 
installed, those skipped tests will be executed properly 
(https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/testReport/org.apache.spark.sql/SQLQueryTestSuite/)


was (Author: hyukjin.kwon):
Just quick FYI, {{spark-master-test-maven-hadoop-3.2]} builder would need to 
install pyarrow and pandas identically like PR builder ... I believe this could 
be done separately in a separate ticket if it matters
cc [~yumwang] since you reached out to me about this. Once they are properly 
installed, those skipped tests will be executed properly 
(https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/testReport/org.apache.spark.sql/SQLQueryTestSuite/)

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> fi
> if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
>   if [[ $retcode1 -ne 0 ]]; then
> echo "Packaging Spark with Maven failed"
>   fi
>   if [[ $retcode2 -ne 0 ]]; then
> echo "Testing Spark with Maven failed"
>

[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919203#comment-16919203
 ] 

Hyukjin Kwon commented on SPARK-28900:
--

Just quick FYI, {{spark-master-test-maven-hadoop-3.2]} builder would need to 
install pyarrow and pandas identically like PR builder ... I believe this could 
be done separately in a separate ticket if it matters
cc [~yumwang] since you reached out to me about this. Once they are properly 
installed, those skipped tests will be executed properly 
(https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/testReport/org.apache.spark.sql/SQLQueryTestSuite/)

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> fi
> if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
>   if [[ $retcode1 -ne 0 ]]; then
> echo "Packaging Spark with Maven failed"
>   fi
>   if [[ $retcode2 -ne 0 ]]; then
> echo "Testing Spark with Maven failed"
>   fi
>   exit 1
> fi
> {code}
> The PR builder (one of them at least) looks like:
> {code}
> #!/bin/bash
> set -e  # fail on any non-zero exit code
> set -x
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
>

[jira] [Issue Comment Deleted] (SPARK-28809) Document SHOW TABLE in SQL Reference

2019-08-29 Thread Shivu Sondur (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivu Sondur updated SPARK-28809:
-
Comment: was deleted

(was: [~dkbiswal]

i am not able to find the "SHOW TABLE" command in spark

"SHOW TABLES " command is present,

please recheck, whether this Jira is valid?)

> Document SHOW TABLE in SQL Reference
> 
>
> Key: SPARK-28809
> URL: https://issues.apache.org/jira/browse/SPARK-28809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919190#comment-16919190
 ] 

Hyukjin Kwon commented on SPARK-28906:
--

cc [~kiszk] FYI

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919188#comment-16919188
 ] 

Hyukjin Kwon commented on SPARK-28916:
--

Reproducer:

{code}
val df = spark.range(100).selectExpr((0 to 5000).map(i => s"id as field_$i"): 
_*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
data.describe()
{code}

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps：
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then，The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>

[jira] [Updated] (SPARK-28910) Prevent schema verification when connecting to in memory derby

2019-08-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28910:
-
Description: 
When {{hive.metastore.schema.verification=true}}, 
{{HiveUtils.newClientForExecution}} fails with 

{code}
19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should not 
accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58)
...
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
{code}

This prevents Thriftserver from starting

  was:
When hive.metastore.schema.verification=true, HiveUtils.newClientForExecution 
fails with 
{{19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should 
not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58)
...
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient}}


This prevents Thriftserver from starting


> Prevent schema verification when connecting to in memory derby
> --
>
> Key: SPARK-28910
> URL: https://issues.apache.org/jira/browse/SPARK-28910
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Juliusz Sompolski
>Priority: Major
>
> When {{hive.metastore.schema.verification=true}}, 
> {{HiveUtils.newClientForExecution}} fails with 
> {code}
> 19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should 
> not accessed in runtime.
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
> at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58)
> ...
> Caused by: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> {code}
> This prevents Thriftserver from starting



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28910) Prevent schema verification when connecting to in memory derby

2019-08-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28910:
-
Component/s: (was: Spark Core)
 SQL

> Prevent schema verification when connecting to in memory derby
> --
>
> Key: SPARK-28910
> URL: https://issues.apache.org/jira/browse/SPARK-28910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Juliusz Sompolski
>Priority: Major
>
> When {{hive.metastore.schema.verification=true}}, 
> {{HiveUtils.newClientForExecution}} fails with 
> {code}
> 19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should 
> not accessed in runtime.
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
> at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58)
> ...
> Caused by: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
> {code}
> This prevents Thriftserver from starting



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28861) Jetty property handling: java.lang.NumberFormatException: For input string: "unknown".

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919183#comment-16919183
 ] 

Hyukjin Kwon commented on SPARK-28861:
--

ping [~Shrikhande]

> Jetty property handling: java.lang.NumberFormatException: For input string: 
> "unknown".
> --
>
> Key: SPARK-28861
> URL: https://issues.apache.org/jira/browse/SPARK-28861
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 2.4.3
>Reporter: Ketan
>Priority: Minor
>
> While processing data from certain files a {{NumberFormatExceltion}} was seen 
> in the logs. The processing was fine but the following stacktrace was 
> observed:
> {code}
> {"time":"2019-08-16 
> 08:21:36,733","level":"DEBUG","class":"o.s.j.u.Jetty","message":"","thread":"Driver","appName":"app-name","appVersion":"APPLICATION_VERSION","type":"APPLICATION","errorCode":"ERROR_CODE","errorId":""}
> java.lang.NumberFormatException: For input string: "unknown".
> {code}
> On investigation it is found that in the class Jetty there is the following:
> {code}
> BUILD_TIMESTAMP = formatTimestamp(__buildProperties.getProperty("timestamp", 
> "unknown")); 
> {code}
> which indicates that the config should have the 'timestamp' property. If the 
> property is not there then the default value is set as 'unknown' and this 
> value causes the stacktrace to show up in the logs in our application. It has 
> no detrimental effect on the application as such but could be addressed.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28912) MatchError exception in CheckpointWriteHandler

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919182#comment-16919182
 ] 

Hyukjin Kwon commented on SPARK-28912:
--

It might be much easier for other poeple write a test and investigate further 
if there's a self-contained reproducer, if you're not going to open a PR.

> MatchError exception in CheckpointWriteHandler
> --
>
> Key: SPARK-28912
> URL: https://issues.apache.org/jira/browse/SPARK-28912
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.2
>Reporter: Aleksandr Kashkirov
>Priority: Minor
>
> Setting checkpoint directory name to "checkpoint-" plus some digits (e.g. 
> "checkpoint-01") results in the following error:
> {code:java}
> Exception in thread "pool-32-thread-1" scala.MatchError: 
> 0523a434-0daa-4ea6-a050-c4eb3c557d8c (of class java.lang.String) 
>  at 
> org.apache.spark.streaming.Checkpoint$.org$apache$spark$streaming$Checkpoint$$sortFunc$1(Checkpoint.scala:121)
>  
>  at 
> org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132)
>  
>  at 
> org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132)
>  
>  at scala.math.Ordering$$anon$9.compare(Ordering.scala:200) 
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) 
>  at java.util.TimSort.sort(TimSort.java:234) 
>  at java.util.Arrays.sort(Arrays.java:1438) 
>  at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) 
>  at scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:186) 
>  at scala.collection.SeqLike$class.sortWith(SeqLike.scala:601) 
>  at scala.collection.mutable.ArrayOps$ofRef.sortWith(ArrayOps.scala:186) 
>  at 
> org.apache.spark.streaming.Checkpoint$.getCheckpointFiles(Checkpoint.scala:132)
>  
>  at 
> org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:262)
>  
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
>  at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28924) spark.jobGroup.id and spark.job.description is missing in Document

2019-08-29 Thread Sharanabasappa G Keriwaddi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919181#comment-16919181
 ] 

Sharanabasappa G Keriwaddi commented on SPARK-28924:


[~abhishek.akg] , Thanks. I will check this and raise PR is required.

> spark.jobGroup.id and spark.job.description is missing in Document
> --
>
> Key: SPARK-28924
> URL: https://issues.apache.org/jira/browse/SPARK-28924
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Require to update *spark.jobGroup.id* and *spark.job.description* in product 
> doc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28899) merge the testing in-memory v2 catalogs from catalyst and core

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919180#comment-16919180
 ] 

Hyukjin Kwon commented on SPARK-28899:
--

Fixed in https://github.com/apache/spark/pull/25610

> merge the testing in-memory v2 catalogs from catalyst and core
> --
>
> Key: SPARK-28899
> URL: https://issues.apache.org/jira/browse/SPARK-28899
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28924) spark.jobGroup.id and spark.job.description is missing in Document

2019-08-29 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-28924:


 Summary: spark.jobGroup.id and spark.job.description is missing in 
Document
 Key: SPARK-28924
 URL: https://issues.apache.org/jira/browse/SPARK-28924
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.4.3
Reporter: ABHISHEK KUMAR GUPTA


Require to update *spark.jobGroup.id* and *spark.job.description* in product 
doc.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919173#comment-16919173
 ] 

Hyukjin Kwon commented on SPARK-28918:
--

BTW {{from_utc_timestamp}} is deprecated as of Spark 3.0.

> from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
> 
>
> Key: SPARK-28918
> URL: https://issues.apache.org/jira/browse/SPARK-28918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: I'm using Spark through Databricks
>Reporter: Luiz Hissashi da Rocha
>Priority: Minor
>
> I realized that *from_utc_timestamp* function is assuming that Brazil will 
> have DST in 2019 but it will not, unlike previous years. Because of that, 
> when I run the function bellow, instead of having "2019-11-14" (São Paulo is 
> UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due 
> to DST).
> {code:java}
> // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019

2019-08-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28918.
--
Resolution: Not A Problem

> from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
> 
>
> Key: SPARK-28918
> URL: https://issues.apache.org/jira/browse/SPARK-28918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: I'm using Spark through Databricks
>Reporter: Luiz Hissashi da Rocha
>Priority: Minor
>
> I realized that *from_utc_timestamp* function is assuming that Brazil will 
> have DST in 2019 but it will not, unlike previous years. Because of that, 
> when I run the function bellow, instead of having "2019-11-14" (São Paulo is 
> UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due 
> to DST).
> {code:java}
> // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28916:
-
Target Version/s:   (was: 2.3.1, 2.4.3)

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps：
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then，The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
>

[jira] [Created] (SPARK-28923) Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'

2019-08-29 Thread Xianyin Xin (Jira)

Xianyin Xin created SPARK-28923:
---

 Summary: Deduplicate the codes 'multipartIdentifier' and 
'identifierSeq'
 Key: SPARK-28923
 URL: https://issues.apache.org/jira/browse/SPARK-28923
 Project: Spark
  Issue Type: Request
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xianyin Xin


In {{sqlbase.g4}}, {{multipartIdentifier}} and {{identifierSeq}} have the same 
functionality. We'd better deduplicate them.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28920) Set up java version for github workflow

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28920.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25625
[https://github.com/apache/spark/pull/25625]

> Set up java version for github workflow
> ---
>
> Key: SPARK-28920
> URL: https://issues.apache.org/jira/browse/SPARK-28920
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We added java matrix to github workflow. As we want to build with JDK8/11, we 
> should set up java version for mvn.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28920) Set up java version for github workflow

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28920:
-

Assignee: Liang-Chi Hsieh

> Set up java version for github workflow
> ---
>
> Key: SPARK-28920
> URL: https://issues.apache.org/jira/browse/SPARK-28920
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> We added java matrix to github workflow. As we want to build with JDK8/11, we 
> should set up java version for mvn.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28919) Add more profiles for JDK8/11 build test

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28919.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25624
[https://github.com/apache/spark/pull/25624]

> Add more profiles for JDK8/11 build test
> 
>
> Key: SPARK-28919
> URL: https://issues.apache.org/jira/browse/SPARK-28919
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> From Jenkins, we build with JDK8 and test with JDK8/11.
> We are testing JDK11 build via GitHub workflow. This issue aims to add more 
> profiles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28919) Add more profiles for JDK8/11 build test

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28919:
-

Assignee: Dongjoon Hyun

> Add more profiles for JDK8/11 build test
> 
>
> Key: SPARK-28919
> URL: https://issues.apache.org/jira/browse/SPARK-28919
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> From Jenkins, we build with JDK8 and test with JDK8/11.
> We are testing JDK11 build via GitHub workflow. This issue aims to add more 
> profiles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28922) Safe Kafka parameter redaction

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28922:
-

Assignee: Gabor Somogyi

> Safe Kafka parameter redaction
> --
>
> Key: SPARK-28922
> URL: https://issues.apache.org/jira/browse/SPARK-28922
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
>
> At the moment Kafka parameter reduction is expecting SparkEnv. This must 
> exist in normal queries but several unit tests are not providing it to make 
> things simple. As an end-result such tests are throwing similar exception:



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28922) Safe Kafka parameter redaction

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28922.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25621
[https://github.com/apache/spark/pull/25621]

> Safe Kafka parameter redaction
> --
>
> Key: SPARK-28922
> URL: https://issues.apache.org/jira/browse/SPARK-28922
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.0.0
>
>
> At the moment Kafka parameter reduction is expecting SparkEnv. This must 
> exist in normal queries but several unit tests are not providing it to make 
> things simple. As an end-result such tests are throwing similar exception:



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28922) Safe Kafka parameter redaction

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28922:
--
Reporter: Gabor Somogyi  (was: Dongjoon Hyun)

> Safe Kafka parameter redaction
> --
>
> Key: SPARK-28922
> URL: https://issues.apache.org/jira/browse/SPARK-28922
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> At the moment Kafka parameter reduction is expecting SparkEnv. This must 
> exist in normal queries but several unit tests are not providing it to make 
> things simple. As an end-result such tests are throwing similar exception:



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28922) Safe Kafka parameter redaction

2019-08-29 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-28922:
-

 Summary: Safe Kafka parameter redaction
 Key: SPARK-28922
 URL: https://issues.apache.org/jira/browse/SPARK-28922
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


At the moment Kafka parameter reduction is expecting SparkEnv. This must exist 
in normal queries but several unit tests are not providing it to make things 
simple. As an end-result such tests are throwing similar exception:



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-29 Thread Paul Schweigert (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Schweigert updated SPARK-28921:

Description: 
Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
provision executor pods (jobs like Spark-Pi that do not launch executors run 
without a problem):

 

Here's an example error message:

 
{code:java}
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: HTTP 
403, Status: 403 - 
java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' 
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{code}
 

Looks like the issue is caused by the internal master Kubernetes url not having 
the port specified:

[https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]

 

Using the master with the port (443) seems to fix the problem.

 

  was:
Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
provision executor pods (jobs like Spark-Pi that do not launch executors run 
without a problem):

 

Here's an example error message:

 
{code:java}
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 
executors from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec 
Failure: HTTP 403, Status: 403 - java.net.ProtocolException: Expected HTTP 101 
response but was '403 Forbidden' at 
okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at 
okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at 
okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at 
okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{code}
 

Looks like the issue is caused by the internal master Kubernetes url not having 
the port specified:

[https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]

 

Using the master with the port (443) seems to fix the problem.

 


> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)
> -
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Paul Schweigert
>Priority: Minor
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by the internal master Kubernetes url not 
> having the port specified:
> [https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]
>  
> Using the master with the port (443) seems to fix the problem.
>  



--
This message

[jira] [Created] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)

2019-08-29 Thread Paul Schweigert (Jira)

Paul Schweigert created SPARK-28921:
---

 Summary: Spark jobs failing on latest versions of Kubernetes 
(1.15.3, 1.14.6, 1,13.10)
 Key: SPARK-28921
 URL: https://issues.apache.org/jira/browse/SPARK-28921
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.3
Reporter: Paul Schweigert


Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
provision executor pods (jobs like Spark-Pi that do not launch executors run 
without a problem):

 

Here's an example error message:

 
{code:java}
19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors from 
Kubernetes.19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 
executors from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec 
Failure: HTTP 403, Status: 403 - java.net.ProtocolException: Expected HTTP 101 
response but was '403 Forbidden' at 
okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) at 
okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) at 
okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) at 
okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{code}
 

Looks like the issue is caused by the internal master Kubernetes url not having 
the port specified:

[https://github.com/apache/spark/blob/master//resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L82:7]

 

Using the master with the port (443) seems to fix the problem.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption

2019-08-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28843.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25545
[https://github.com/apache/spark/pull/25545]

> Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
> --
>
> Key: SPARK-28843
> URL: https://issues.apache.org/jira/browse/SPARK-28843
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> While testing hardware with more cores, we found that the amount of memory 
> required by PySpark applications increased and tracked the problem to 
> importing numpy. The numpy issue is 
> [https://github.com/numpy/numpy/issues/10455]
> NumPy uses OpenMP that starts a thread pool with the number of cores on the 
> machine (and does not respect cgroups). When we set this lower we see a 
> significant reduction in memory consumption.
> This parallelism setting should be set to the number of cores allocated to 
> the executor, not the number of cores available.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption

2019-08-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28843:


Assignee: Ryan Blue

> Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
> --
>
> Key: SPARK-28843
> URL: https://issues.apache.org/jira/browse/SPARK-28843
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>  Labels: release-notes
>
> While testing hardware with more cores, we found that the amount of memory 
> required by PySpark applications increased and tracked the problem to 
> importing numpy. The numpy issue is 
> [https://github.com/numpy/numpy/issues/10455]
> NumPy uses OpenMP that starts a thread pool with the number of cores on the 
> machine (and does not respect cgroups). When we set this lower we see a 
> significant reduction in memory consumption.
> This parallelism setting should be set to the number of cores allocated to 
> the executor, not the number of cores available.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28920) Set up java version for github workflow

2019-08-29 Thread Liang-Chi Hsieh (Jira)

Liang-Chi Hsieh created SPARK-28920:
---

 Summary: Set up java version for github workflow
 Key: SPARK-28920
 URL: https://issues.apache.org/jira/browse/SPARK-28920
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We added java matrix to github workflow. As we want to build with JDK8/11, we 
should set up java version for mvn.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019

2019-08-29 Thread Shivu Sondur (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919045#comment-16919045
 ] 

Shivu Sondur commented on SPARK-28918:
--

[~hissashirocha]

Brazil DST removed on Sunday, 17 February 2019.

You need to update your jvm TZupdater. Refer following link link 
https://www.oracle.com/technetwork/java/javase/timezones-137583.html#tzu

> from_utc_timestamp function is mistakenly considering DST for Brazil in 2019
> 
>
> Key: SPARK-28918
> URL: https://issues.apache.org/jira/browse/SPARK-28918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: I'm using Spark through Databricks
>Reporter: Luiz Hissashi da Rocha
>Priority: Minor
>
> I realized that *from_utc_timestamp* function is assuming that Brazil will 
> have DST in 2019 but it will not, unlike previous years. Because of that, 
> when I run the function bellow, instead of having "2019-11-14" (São Paulo is 
> UTC-3h), I still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due 
> to DST).
> {code:java}
> // from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28919) Add more profiles for JDK8/11 build test

2019-08-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28919:
--
Description: 
>From Jenkins, we build with JDK8 and test with JDK8/11.
We are testing JDK11 build via GitHub workflow. This issue aims to add more 
profiles.

  was:
>From Jenkins, we build with JDK8 and test with JDK8/11.
This issue aims to add JDK11 build test to GitHub workflow.


> Add more profiles for JDK8/11 build test
> 
>
> Key: SPARK-28919
> URL: https://issues.apache.org/jira/browse/SPARK-28919
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> From Jenkins, we build with JDK8 and test with JDK8/11.
> We are testing JDK11 build via GitHub workflow. This issue aims to add more 
> profiles.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28919) Add more profiles for JDK8/11 build test

2019-08-29 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-28919:
-

 Summary: Add more profiles for JDK8/11 build test
 Key: SPARK-28919
 URL: https://issues.apache.org/jira/browse/SPARK-28919
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


>From Jenkins, we build with JDK8 and test with JDK8/11.
This issue aims to add JDK11 build test to GitHub workflow.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28918) from_utc_timestamp function is mistakenly considering DST for Brazil in 2019

2019-08-29 Thread Luiz Hissashi da Rocha (Jira)

Luiz Hissashi da Rocha created SPARK-28918:
--

 Summary: from_utc_timestamp function is mistakenly considering DST 
for Brazil in 2019
 Key: SPARK-28918
 URL: https://issues.apache.org/jira/browse/SPARK-28918
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
 Environment: I'm using Spark through Databricks
Reporter: Luiz Hissashi da Rocha


I realized that *from_utc_timestamp* function is assuming that Brazil will have 
DST in 2019 but it will not, unlike previous years. Because of that, when I run 
the function bellow, instead of having "2019-11-14" (São Paulo is UTC-3h), I 
still get "2019-11-15T00:18:01" wrongly (as if it was UTC-2h due to DST).
{code:java}
// from_utc_timestamp("2019-11-15T02:18:01.000+", 'America/Sao_Paulo')
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28899) merge the testing in-memory v2 catalogs from catalyst and core

2019-08-29 Thread Ryan Blue (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-28899.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> merge the testing in-memory v2 catalogs from catalyst and core
> --
>
> Key: SPARK-28899
> URL: https://issues.apache.org/jira/browse/SPARK-28899
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28760) Add end-to-end Kafka delegation token test

2019-08-29 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28760.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25477
[https://github.com/apache/spark/pull/25477]

> Add end-to-end Kafka delegation token test
> --
>
> Key: SPARK-28760
> URL: https://issues.apache.org/jira/browse/SPARK-28760
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> At the moment no end-to-end Kafka delegation token test exists which was 
> mainly because of missing KDC. KDC is missing in general from the testing 
> side so I've discovered what kind of possibilities are there. The most 
> obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby 
> runs in the background. In this jira I would like to add Kerby to the testing 
> area and use it to cover security related features.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28760) Add end-to-end Kafka delegation token test

2019-08-29 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28760:
--

Assignee: Gabor Somogyi

> Add end-to-end Kafka delegation token test
> --
>
> Key: SPARK-28760
> URL: https://issues.apache.org/jira/browse/SPARK-28760
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> At the moment no end-to-end Kafka delegation token test exists which was 
> mainly because of missing KDC. KDC is missing in general from the testing 
> side so I've discovered what kind of possibilities are there. The most 
> obvious choice is the MiniKDC inside the Hadoop library where Apache Kerby 
> runs in the background. In this jira I would like to add Kerby to the testing 
> area and use it to cover security related features.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25874) Simplify abstractions in the K8S backend

2019-08-29 Thread Marcelo Vanzin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918813#comment-16918813
 ] 

Marcelo Vanzin commented on SPARK-25874:


I'm going to close this one; the only remaining task is for adding (internal 
developer) documentation, which is minor.

> Simplify abstractions in the K8S backend
> 
>
> Key: SPARK-25874
> URL: https://issues.apache.org/jira/browse/SPARK-25874
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> I spent some time recently re-familiarizing myself with the k8s backend, and 
> I think there is room for improvement. In the past, SPARK-22839 was added 
> which improved things a lot, but it is still hard to follow the code, and it 
> is still more complicated than it should be to add a new feature.
> I've worked on the main things that were bothering me and came up with these 
> changes:
> https://github.com/vanzin/spark/commits/k8s-simple
> Now that patch (first commit of the branch) is a little large, which makes it 
> hard to review and to properly assess what it is doing. So I plan to break it 
> down into a few steps that I will file as sub-tasks and send for review 
> independently.
> The commit message for that patch has a lot of the background of what I 
> changed and why. Since I plan to delete that branch after the work is done, 
> I'll paste it here:
> {noformat}
> There are two main changes happening here.
> (1) Simplify the KubernetesConf abstraction.
> The current code around KubernetesConf has a few drawbacks:
> - it uses composition (with a type parameter) for role-specific configuration
> - it breaks encapsulation of the user configuration, held in SparkConf, by
>   requiring that all the k8s-specific info is extracted from SparkConf before
>   the KubernetesConf object is created.
> - the above is usually done by parsing the SparkConf info into k8s-backend
>   types, which are then transformed into k8s requests.
> This ends up requiring a whole lot of code that is just not necessary.
> The type parameters make parameter and class declarations full of needless
> noise; the breakage of encapsulation makes the code that processes SparkConf
> and the code that builds the k8s descriptors live in different places, and
> the intermediate representation isn't adding much value.
> By using inheritance instead of the current model, role-specific
> specialization of certain config properties works simply by implementing some
> abstract methods of the base class (instead of breaking encapsulation), and
> there's no need anymore for parameterized types.
> By moving config processing to the code that actually transforms the config
> into k8s descriptors, a lot of intermediate boilerplate can be removed.
> This leads to...
> (2) Make all feature logic part of the feature step itself.
> Currently there's code in a lot of places to decide whether a feature
> should be enabled. There's code when parsing the configuration, building
> the custom intermediate representation in a way that is later used by
> different code in a builder class, which then decides whether feature A
> or feature B should be used.
> Instead, it's much cleaner to let a feature decide things for itself.
> If the config to enable feature A exists, it proceses the config and
> sets up the necessary k8s descriptors. If it doesn't, the feature is
> a no-op.
> This simplifies the shared code that calls into the existing features
> a lot. And does not make the existing features any more complicated.
> As part of this I merged the different language binding feature steps
> into a single step. They are sort of related, in the sense that if
> one is applied the others shouldn't, and merging them makes the logic
> to implement that cleaner.
> The driver and executor builders are now also much simpler, since they
> have no logic about what steps to apply or not. The tests were removed
> because of that, and some new tests were added to the suites for
> specific features, to verify what the old builder suites were testing.
> On top of the above I made a few minor changes (in comparison):
> - KubernetesVolumeUtils was modified to just throw exceptions. The old
>   code tried to avoid throwing exceptions by collecting results in `Try`
>   objects. That was not achieving anything since all the callers would
>   just call `get` on those objects, and the first one with a failure
>   would just throw the exception. The new code achieves the same
>   behavior and is simpler.
> - A bunch of small things, mainly to bring the code in line with the
>   usual Spark code style. I also removed unnecessary mocking in tests,
>   unused imports, and unused configs and constants.
> - Added some basic tests for

[jira] [Resolved] (SPARK-25874) Simplify abstractions in the K8S backend

2019-08-29 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25874.

Fix Version/s: 3.0.0
   Resolution: Done

> Simplify abstractions in the K8S backend
> 
>
> Key: SPARK-25874
> URL: https://issues.apache.org/jira/browse/SPARK-25874
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> I spent some time recently re-familiarizing myself with the k8s backend, and 
> I think there is room for improvement. In the past, SPARK-22839 was added 
> which improved things a lot, but it is still hard to follow the code, and it 
> is still more complicated than it should be to add a new feature.
> I've worked on the main things that were bothering me and came up with these 
> changes:
> https://github.com/vanzin/spark/commits/k8s-simple
> Now that patch (first commit of the branch) is a little large, which makes it 
> hard to review and to properly assess what it is doing. So I plan to break it 
> down into a few steps that I will file as sub-tasks and send for review 
> independently.
> The commit message for that patch has a lot of the background of what I 
> changed and why. Since I plan to delete that branch after the work is done, 
> I'll paste it here:
> {noformat}
> There are two main changes happening here.
> (1) Simplify the KubernetesConf abstraction.
> The current code around KubernetesConf has a few drawbacks:
> - it uses composition (with a type parameter) for role-specific configuration
> - it breaks encapsulation of the user configuration, held in SparkConf, by
>   requiring that all the k8s-specific info is extracted from SparkConf before
>   the KubernetesConf object is created.
> - the above is usually done by parsing the SparkConf info into k8s-backend
>   types, which are then transformed into k8s requests.
> This ends up requiring a whole lot of code that is just not necessary.
> The type parameters make parameter and class declarations full of needless
> noise; the breakage of encapsulation makes the code that processes SparkConf
> and the code that builds the k8s descriptors live in different places, and
> the intermediate representation isn't adding much value.
> By using inheritance instead of the current model, role-specific
> specialization of certain config properties works simply by implementing some
> abstract methods of the base class (instead of breaking encapsulation), and
> there's no need anymore for parameterized types.
> By moving config processing to the code that actually transforms the config
> into k8s descriptors, a lot of intermediate boilerplate can be removed.
> This leads to...
> (2) Make all feature logic part of the feature step itself.
> Currently there's code in a lot of places to decide whether a feature
> should be enabled. There's code when parsing the configuration, building
> the custom intermediate representation in a way that is later used by
> different code in a builder class, which then decides whether feature A
> or feature B should be used.
> Instead, it's much cleaner to let a feature decide things for itself.
> If the config to enable feature A exists, it proceses the config and
> sets up the necessary k8s descriptors. If it doesn't, the feature is
> a no-op.
> This simplifies the shared code that calls into the existing features
> a lot. And does not make the existing features any more complicated.
> As part of this I merged the different language binding feature steps
> into a single step. They are sort of related, in the sense that if
> one is applied the others shouldn't, and merging them makes the logic
> to implement that cleaner.
> The driver and executor builders are now also much simpler, since they
> have no logic about what steps to apply or not. The tests were removed
> because of that, and some new tests were added to the suites for
> specific features, to verify what the old builder suites were testing.
> On top of the above I made a few minor changes (in comparison):
> - KubernetesVolumeUtils was modified to just throw exceptions. The old
>   code tried to avoid throwing exceptions by collecting results in `Try`
>   objects. That was not achieving anything since all the callers would
>   just call `get` on those objects, and the first one with a failure
>   would just throw the exception. The new code achieves the same
>   behavior and is simpler.
> - A bunch of small things, mainly to bring the code in line with the
>   usual Spark code style. I also removed unnecessary mocking in tests,
>   unused imports, and unused configs and constants.
> - Added some basic tests for KerberosConfDriverFeatureStep.
> Note that there may still be leftover intermediate

[jira] [Resolved] (SPARK-27611) Redundant javax.activation dependencies in the Maven build

2019-08-29 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27611.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/24507

> Redundant javax.activation dependencies in the Maven build
> --
>
> Key: SPARK-27611
> URL: https://issues.apache.org/jira/browse/SPARK-27611
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 3.0.0
>
>
> [PR #23890|https://github.com/apache/spark/pull/23890] introduced 
> {{org.glassfish.jaxb:jaxb-runtime:2.3.2}} as a runtime dependency. As an 
> unexpected side effect, {{jakarta.activation:jakarta.activation-api:1.2.1}} 
> was also pulled in as a transitive dependency. As a result, for the Maven 
> build, both of the following two jars can be found under 
> {{assembly/target/scala-2.12/jars}}:
> {noformat}
> activation-1.1.1.jar
> jakarta.activation-api-1.2.1.jar
> {noformat}
> Discussed this with [~srowen] offline and we agreed that we should probably 
> exclude the Jakarta one.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28807) Document SHOW DATABASES in SQL Reference.

2019-08-29 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28807.
-
Fix Version/s: 3.0.0
 Assignee: Dilip Biswal
   Resolution: Fixed

> Document SHOW DATABASES in SQL Reference.
> -
>
> Key: SPARK-28807
> URL: https://issues.apache.org/jira/browse/SPARK-28807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-08-29 Thread Imran Rashid (Jira)

Imran Rashid created SPARK-28917:


 Summary: Jobs can hang because of race of RDD.dependencies
 Key: SPARK-28917
 URL: https://issues.apache.org/jira/browse/SPARK-28917
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.4.3, 2.3.3
Reporter: Imran Rashid


{{RDD.dependencies}} stores the precomputed cache value, but it is not 
thread-safe.  This can lead to a race where the value gets overwritten, but the 
DAGScheduler gets stuck in an inconsistent state.  In particular, this can 
happen when there is a race between the DAGScheduler event loop, and another 
thread (eg. a user thread, if there is multi-threaded job submission).


First, a job is submitted by the user, which then computes the result Stage and 
its parents:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983

Which eventually makes a call to {{rdd.dependencies}}:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519

At the same time, the user could also touch {{rdd.dependencies}} in another 
thread, which could overwrite the stored value because of the race.

Then the DAGScheduler checks the dependencies *again* later on in the job 
submission, via {{getMissingParentStages}}

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025

Because it will find new dependencies, it will create entirely different 
stages.  Now the job has some orphaned stages which will never run.

The symptoms of this are seeing disjoint sets of stages in the "Parents of 
final stage" and the "Missing parents" messages on job submission, as well as 
seeing repeated messages "Registered RDD X" for the same RDD id.  eg:

{noformat}
[INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - Starting 
job: count at XXX.scala:462
...
[INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler logInfo 
- Registering RDD 14 (repartition at XXX.scala:421)
...
...
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Got job 1 (count at XXX.scala:462) with 40 output partitions
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 5 (count at XXX.scala:462)
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Parents of final stage: List(ShuffleMapStage 4)
[INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo 
- Registering RDD 14 (repartition at XXX.scala:421)
[INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo 
- Missing parents: List(ShuffleMapStage 6)
{noformat}

Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way it 
could mess up the scheduler (seems its only used for 
{{rdd.partitions.length}}).  There is also an issue that {{rdd.storageLevel}} 
is read and cached in the scheduler, but it could be modified simultaneously by 
the user in another thread.   Similarly, I can't see a way it could effect the 
scheduler.

*WORKAROUND*:
(a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
by one thread (eg. in the thread that created it, or before you submit multiple 
jobs touching that RDD from other threads). Then that value will get cached.
(b) don't submit jobs from multiple threads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28786) Document INSERT statement in SQL Reference.

2019-08-29 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28786.
-
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

> Document INSERT statement in SQL Reference.
> ---
>
> Key: SPARK-28786
> URL: https://issues.apache.org/jira/browse/SPARK-28786
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-08-29 Thread Imran Rashid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918764#comment-16918764
 ] 

Imran Rashid commented on SPARK-28917:
--

[~markhamstra] [~jiangxb1987] [~tgraves] [~Ngone51] would appreciate your 
thoughts on this.  I think the bug I've described above is pretty clear.  
However, the part which I'm wondering about a bit more is whether there is more 
mutability in RDD that could cause problems.

For the case I have of this, I only know for sure that the user is calling 
{{rdd.cache()}} in another thread.  But I can't see how that would leave to the 
symptoms I describe above.  I don't know that they are doing anything in ther 
user thread which would touch {{rdd.dependencies}}, but I also don't have full 
visibility into everything they are doing, so this still seems like the best 
explanation to me.

> Jobs can hang because of race of RDD.dependencies
> -
>
> Key: SPARK-28917
> URL: https://issues.apache.org/jira/browse/SPARK-28917
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Imran Rashid
>Priority: Major
>
> {{RDD.dependencies}} stores the precomputed cache value, but it is not 
> thread-safe.  This can lead to a race where the value gets overwritten, but 
> the DAGScheduler gets stuck in an inconsistent state.  In particular, this 
> can happen when there is a race between the DAGScheduler event loop, and 
> another thread (eg. a user thread, if there is multi-threaded job submission).
> First, a job is submitted by the user, which then computes the result Stage 
> and its parents:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983
> Which eventually makes a call to {{rdd.dependencies}}:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519
> At the same time, the user could also touch {{rdd.dependencies}} in another 
> thread, which could overwrite the stored value because of the race.
> Then the DAGScheduler checks the dependencies *again* later on in the job 
> submission, via {{getMissingParentStages}}
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025
> Because it will find new dependencies, it will create entirely different 
> stages.  Now the job has some orphaned stages which will never run.
> The symptoms of this are seeing disjoint sets of stages in the "Parents of 
> final stage" and the "Missing parents" messages on job submission, as well as 
> seeing repeated messages "Registered RDD X" for the same RDD id.  eg:
> {noformat}
> [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - 
> Starting job: count at XXX.scala:462
> ...
> [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> ...
> ...
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 5 (count at XXX.scala:462)
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Parents of final stage: List(ShuffleMapStage 4)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Missing parents: List(ShuffleMapStage 6)
> {noformat}
> Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way 
> it could mess up the scheduler (seems its only used for 
> {{rdd.partitions.length}}).  There is also an issue that {{rdd.storageLevel}} 
> is read and cached in the scheduler, but it could be modified simultaneously 
> by the user in another thread.   Similarly, I can't see a way it could effect 
> the scheduler.
> *WORKAROUND*:
> (a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
> by one thread (eg. in the thread that created it, or before you submit 
> multiple jobs touching that RDD from other threads). Then that value will get 
> cached.
> (b) don't submit jobs from multiple threads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-29 Thread MOBIN (Jira)

MOBIN created SPARK-28916:
-

 Summary: Generated SpecificSafeProjection.apply method grows 
beyond 64 KB when use  SparkSQL
 Key: SPARK-28916
 URL: https://issues.apache.org/jira/browse/SPARK-28916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 2.3.1
Reporter: MOBIN


Can be reproduced by the following steps：

1. Create a table with 5000 fields

2. val data=spark.sql("select * from spark64kb limit 10");

3. data.describe()

Then，The following error occurred
{code:java}
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
executor 1): org.codehaus.janino.InternalCompilerException: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
 grows beyond 64 KB
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
  at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
  at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
  at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
  at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
  at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
  at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
  at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
  at 
org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
  at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
  at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
  at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
  at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
"GeneratedClass": Code of method "apply(Ljava/lang/Object;)Ljava/lang/Object;" 
of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
 grows beyond 64 KB
  at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382)
  at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237)
  at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
  at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
  at

[jira] [Updated] (SPARK-28915) The new keyword is not used when instantiating the WorkerOffer.

2019-08-29 Thread Jiaqi Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaqi Li updated SPARK-28915:
-
Description: 
WorkerOffer is a case class, so we don't need to use new when instantiating it, 
which makes code more concise.
{code:java}
private[spark]
case class WorkerOffer(
executorId: String,
host: String,
cores: Int,
address: Option[String] = None,
resources: Map[String, Buffer[String]] = Map.empty)
{code}

  was:`WorkerOffer` is a case class, so we don't need to use `new` when 
instantiating it, which makes code more concise.


> The new keyword is not used when instantiating the WorkerOffer.
> ---
>
> Key: SPARK-28915
> URL: https://issues.apache.org/jira/browse/SPARK-28915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Jiaqi Li
>Priority: Minor
>
> WorkerOffer is a case class, so we don't need to use new when instantiating 
> it, which makes code more concise.
> {code:java}
> private[spark]
> case class WorkerOffer(
> executorId: String,
> host: String,
> cores: Int,
> address: Option[String] = None,
> resources: Map[String, Buffer[String]] = Map.empty)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28915) The new keyword is not used when instantiating the WorkerOffer.

2019-08-29 Thread Jiaqi Li (Jira)

Jiaqi Li created SPARK-28915:


 Summary: The new keyword is not used when instantiating the 
WorkerOffer.
 Key: SPARK-28915
 URL: https://issues.apache.org/jira/browse/SPARK-28915
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Jiaqi Li


`WorkerOffer` is a case class, so we don't need to use `new` when instantiating 
it, which makes code more concise.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28914) GMM Returning Clusters With Equal Probability

2019-08-29 Thread Nirek Sharma (Jira)

Nirek Sharma created SPARK-28914:


 Summary: GMM Returning Clusters With Equal Probability 
 Key: SPARK-28914
 URL: https://issues.apache.org/jira/browse/SPARK-28914
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 2.4.3
 Environment: 
https://gist.github.com/sharmanirek/172c9408e8393462ae54dfae83764413

Test Data & notebook: 

https://drive.google.com/open?id=1YEBsUJv9p2XaTQzsKI9GxLKVD7oe1tan
Reporter: Nirek Sharma


Fitting and Transforming with a GMM yields all categories with equal 
probability. See the below gist demonstrating the code to reproduce the 
problem. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Description: 
The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 
325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
java.lang.ArrayIndexOutOfBoundsException: 6741 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
 at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{quote}
This exception happened sometimes.  And we also found that the AUC metric was 
not stable when evaluating the inner product of the user factors and the item 
factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
which was not stable for production environment. 

Dataset capacity: ~12 billion ratings
 Here is the our code:
{code:java}
val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
  .groupByKey().zipWithIndex()
  .persist(StorageLevel.MEMORY_AND_DISK_SER)
val predataUser = 
predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
  .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ 
b).map(r=>(r._1,r._2.toIterable))
  .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER)
//x._2 is the item_id, y._1 is the user_id, y._2 is the rating
val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
y._2.toFloat)))
  .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)

case class ALSData(user:Int, item:Int, rating:Float) extends Serializable
val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
val als = new ALS
val paramMap = ParamMap(als.alpha -> 25000).
  put(als.checkpointInterval, 5).
  put(als.implicitPrefs, true).
  put(als.itemCol, "item").
  put(als.maxIter, 60).
  put(als.nonnegative, false).

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Summary: ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS 
for datasets  with 12 billion instances  (was: ArrayIndexOutOfBoundsException 
in ALS for datasets  with 12 billion instances)

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
>  with 12 billion instances
> 
>
> Key: SPARK-28913
> URL: https://issues.apache.org/jira/browse/SPARK-28913
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
> val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
>   .groupByKey().zipWithIndex()
>   .persist(StorageLevel.MEMORY_AND_DISK_SER)
> val predataUser = 
> predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
>   .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ 
> b).map(r=>(r._1,r._2.toIterable))
>   .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER)
> //x._2 is

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Description: 
The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 
325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
java.lang.ArrayIndexOutOfBoundsException: 6741 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
 at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{quote}
This exception happened sometimes. 

Dataset capacity: ~12 billion ratings
 Here is the our code:
{code:java}
val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
  .groupByKey().zipWithIndex()
  .persist(StorageLevel.MEMORY_AND_DISK_SER)
val predataUser = 
predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
  .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ 
b).map(r=>(r._1,r._2.toIterable))
  .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER)
//x._2 is the item_id, y._1 is the user_id, y._2 is the rating
val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
y._2.toFloat)))
  .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)

case class ALSData(user:Int, item:Int, rating:Float) extends Serializable
val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
val als = new ALS
val paramMap = ParamMap(als.alpha -> 25000).
  put(als.checkpointInterval, 5).
  put(als.implicitPrefs, true).
  put(als.itemCol, "item").
  put(als.maxIter, 60).
  put(als.nonnegative, false).
  put(als.numItemBlocks, 600).
  put(als.numUserBlocks, 600).
  put(als.regParam, 4.5).
  put(als.rank, 25).
  put(als.userCol, "user")
als.fit(ratingData, paramMap)
{code}

  was:
The stack trace is below:
{quote}19/08/28 07:00:40 WARN

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Description: 
The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 
325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
java.lang.ArrayIndexOutOfBoundsException: 6741 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
 at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)
{quote}
This happened sometimes. Dataset capacity: ~12 billion ratings
Here is the our code:
{code:java}

val hivedata = sc.sql(sqltext).select(id,dpid,score).coalesce(numPartitions)
val predataItem =  hivedata.rdd.map(r=>(r._1._1,(r._1._2,r._2.sum)))
  .groupByKey().zipWithIndex()
  .persist(StorageLevel.MEMORY_AND_DISK_SER)
val predataUser = 
predataItem.flatMap(r=>r._1._2.map(y=>(y._1,(r._2.toInt,y._2
  .aggregateByKey(zeroValueArr,numPartitions)((a,b)=> a += b,(a,b)=>a ++ 
b).map(r=>(r._1,r._2.toIterable))
  .zipWithIndex().persist(StorageLevel.MEMORY_AND_DISK_SER)
//x._2 is the item_id, y._1 is the user_id, y._2 is the rating
val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
y._2.toFloat)))
  .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)

case class ALSData(user:Int, item:Int, rating:Float) extends Serializable
val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
val als = new ALS
val paramMap = ParamMap(als.alpha -> 25000).
  put(als.checkpointInterval, 5).
  put(als.implicitPrefs, true).
  put(als.itemCol, "item").
  put(als.maxIter, 60).
  put(als.nonnegative, false).
  put(als.numItemBlocks, 600).
  put(als.numUserBlocks, 600).
  put(als.regParam, 4.5).
  put(als.rank, 25).
  put(als.userCol, "user")
als.fit(ratingData, paramMap)
{code}

  was:
The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for datasets with 12 billion instances

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Summary: ArrayIndexOutOfBoundsException in ALS for datasets  with 12 
billion instances  (was: ArrayIndexOutOfBoundsException in ALS for Large 
datasets)

> ArrayIndexOutOfBoundsException in ALS for datasets  with 12 billion instances
> -
>
> Key: SPARK-28913
> URL: https://issues.apache.org/jira/browse/SPARK-28913
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){quote}
> This happened after the dataset was sub-sampled. 
>  Dataset properties: ~12B ratings
>  Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for Large datasets

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Description: 
The stack trace is below:
{quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for task 
325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
java.lang.ArrayIndexOutOfBoundsException: 6741 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
 at 
org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
 at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at scala.collection.immutable.List.foreach(List.scala:381) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745){quote}
This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances

  was:
The stack trace is below:

{quote}
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

[jira] [Updated] (SPARK-28913) ArrayIndexOutOfBoundsException in ALS for Large datasets

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Summary: ArrayIndexOutOfBoundsException in ALS for Large datasets  (was: 
CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets)

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-28913
> URL: https://issues.apache.org/jira/browse/SPARK-28913
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28913) CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets

2019-08-29 Thread Qiang Wang (Jira)

Qiang Wang created SPARK-28913:
--

 Summary: CLONE - ArrayIndexOutOfBoundsException in ALS for Large 
datasets
 Key: SPARK-28913
 URL: https://issues.apache.org/jira/browse/SPARK-28913
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0, 1.2.0
Reporter: Qiang Wang
Assignee: Xiangrui Meng


The stack trace is below:

{quote}
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
{quote}
This happened after the dataset was sub-sampled. 
Dataset properties: ~12B ratings
Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28913) CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets

2019-08-29 Thread Qiang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28913:
---
Affects Version/s: (was: 1.2.0)
   (was: 1.1.0)
   2.2.1

> CLONE - ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-28913
> URL: https://issues.apache.org/jira/browse/SPARK-28913
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Xiangrui Meng
>Priority: Major
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28912) MatchError exception in CheckpointWriteHandler

2019-08-29 Thread Aleksandr Kashkirov (Jira)

Aleksandr Kashkirov created SPARK-28912:
---

 Summary: MatchError exception in CheckpointWriteHandler
 Key: SPARK-28912
 URL: https://issues.apache.org/jira/browse/SPARK-28912
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2, 2.3.0
Reporter: Aleksandr Kashkirov


Setting checkpoint directory name to "checkpoint-" plus some digits (e.g. 
"checkpoint-01") results in the following error:
{code:java}
Exception in thread "pool-32-thread-1" scala.MatchError: 
0523a434-0daa-4ea6-a050-c4eb3c557d8c (of class java.lang.String) 
 at 
org.apache.spark.streaming.Checkpoint$.org$apache$spark$streaming$Checkpoint$$sortFunc$1(Checkpoint.scala:121)
 
 at 
org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132)
 
 at 
org.apache.spark.streaming.Checkpoint$$anonfun$getCheckpointFiles$1.apply(Checkpoint.scala:132)
 
 at scala.math.Ordering$$anon$9.compare(Ordering.scala:200) 
 at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) 
 at java.util.TimSort.sort(TimSort.java:234) 
 at java.util.Arrays.sort(Arrays.java:1438) 
 at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) 
 at scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:186) 
 at scala.collection.SeqLike$class.sortWith(SeqLike.scala:601) 
 at scala.collection.mutable.ArrayOps$ofRef.sortWith(ArrayOps.scala:186) 
 at 
org.apache.spark.streaming.Checkpoint$.getCheckpointFiles(Checkpoint.scala:132) 
 at 
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:262)
 
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
 at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28911) Unify Kafka source option pattern

2019-08-29 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-28911:

Description: 
Pattern of datasource options is Camel-Case, such as CheckpointLocation, and 
only some Kafka source option is separated with dot, Such as 
fetchOffset.numRetries.
Also we can distinguish the Kafka original options from pattern, such as 
kafka.bootstrap.servers

> Unify Kafka source option pattern
> -
>
> Key: SPARK-28911
> URL: https://issues.apache.org/jira/browse/SPARK-28911
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Major
>
> Pattern of datasource options is Camel-Case, such as CheckpointLocation, and 
> only some Kafka source option is separated with dot, Such as 
> fetchOffset.numRetries.
> Also we can distinguish the Kafka original options from pattern, such as 
> kafka.bootstrap.servers



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28911) Unify Kafka source option pattern

2019-08-29 Thread wenxuanguan (Jira)

wenxuanguan created SPARK-28911:
---

 Summary: Unify Kafka source option pattern
 Key: SPARK-28911
 URL: https://issues.apache.org/jira/browse/SPARK-28911
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: wenxuanguan






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28910) Prevent schema verification when connecting to in memory derby

2019-08-29 Thread Juliusz Sompolski (Jira)

Juliusz Sompolski created SPARK-28910:
-

 Summary: Prevent schema verification when connecting to in memory 
derby
 Key: SPARK-28910
 URL: https://issues.apache.org/jira/browse/SPARK-28910
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Juliusz Sompolski


When hive.metastore.schema.verification=true, HiveUtils.newClientForExecution 
fails with 
{{19/08/14 13:26:55 WARN Hive: Failed to access metastore. This class should 
not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:186)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:143)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:290)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForExecution(HiveUtils.scala:275)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.startWithContext(HiveThriftServer2.scala:58)
...
Caused by: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient}}


This prevents Thriftserver from starting



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28909) ANSI SQL: delete and update does not support in Spark

2019-08-29 Thread ABHISHEK KUMAR GUPTA (Jira)

ABHISHEK KUMAR GUPTA created SPARK-28909:


 Summary: ANSI SQL: delete and update does not support in Spark
 Key: SPARK-28909
 URL: https://issues.apache.org/jira/browse/SPARK-28909
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


delete and update supported in PostgresSQL

create table emp_test(id int);
insert into emp_test values(100);
insert into emp_test values(200);
select * from emp_test;
*delete from emp_test where id=100;*
select * from emp_test;
*update emp_test set id=500 where id=200;*
select * from emp_test;



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException

2019-08-29 Thread Saisai Shao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918386#comment-16918386
 ] 

Saisai Shao commented on SPARK-28340:
-

My simple concern is that there may be other places which will potentially 
throw this "ClosedByInterruptException" during task killing, it seems hard to 
figure out all of them.

> Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught 
> exception while reverting partial writes to file: 
> java.nio.channels.ClosedByInterruptException"
> 
>
> Key: SPARK-28340
> URL: https://issues.apache.org/jira/browse/SPARK-28340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If a Spark task is killed while writing blocks to disk (due to intentional 
> job kills, automated killing of redundant speculative tasks, etc) then Spark 
> may log exceptions like
> {code:java}
> 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file /
> java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}
> If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled 
> task can result in hundreds of these stacktraces being logged.
> Here are some StackOverflow questions asking about this:
>  * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
>  * 
> [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
>  * 
> [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
>  * 
> [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
>  
> Can we prevent this exception from occurring? If not, can we treat this 
> "expected exception" in a special manner to avoid log spam? My concern is 
> that the presence of large numbers of spurious exceptions is confusing to 
> users when they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics

2019-08-29 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-28908:

Affects Version/s: (was: 2.4.3)
   3.0.0

> Structured Streaming Kafka sink support Exactly-Once semantics
> --
>
> Key: SPARK-28908
> URL: https://issues.apache.org/jira/browse/SPARK-28908
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Major
> Attachments: Kafka Sink Exactly-Once Semantics Design Sketch.pdf
>
>
> Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka 
> sink exactly-once semantics with transaction Kafka producer



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics

2019-08-29 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-28908:

Attachment: Kafka Sink Exactly-Once Semantics Design Sketch.pdf

> Structured Streaming Kafka sink support Exactly-Once semantics
> --
>
> Key: SPARK-28908
> URL: https://issues.apache.org/jira/browse/SPARK-28908
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: wenxuanguan
>Priority: Major
> Attachments: Kafka Sink Exactly-Once Semantics Design Sketch.pdf
>
>
> Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka 
> sink exactly-once semantics with transaction Kafka producer



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics

2019-08-29 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-28908:

Attachment: Kafka Sink Exactly-Once Semantics Design Sketch.pdf

> Structured Streaming Kafka sink support Exactly-Once semantics
> --
>
> Key: SPARK-28908
> URL: https://issues.apache.org/jira/browse/SPARK-28908
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: wenxuanguan
>Priority: Major
>
> Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka 
> sink exactly-once semantics with transaction Kafka producer



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics

2019-08-29 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-28908:

Attachment: (was: Kafka Sink Exactly-Once Semantics Design Sketch.pdf)

> Structured Streaming Kafka sink support Exactly-Once semantics
> --
>
> Key: SPARK-28908
> URL: https://issues.apache.org/jira/browse/SPARK-28908
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: wenxuanguan
>Priority: Major
>
> Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka 
> sink exactly-once semantics with transaction Kafka producer



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics

2019-08-29 Thread wenxuanguan (Jira)

wenxuanguan created SPARK-28908:
---

 Summary: Structured Streaming Kafka sink support Exactly-Once 
semantics
 Key: SPARK-28908
 URL: https://issues.apache.org/jira/browse/SPARK-28908
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: wenxuanguan






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28908) Structured Streaming Kafka sink support Exactly-Once semantics

2019-08-29 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-28908:

Description: Since Apache Kafka supports transaction from 0.11.0.0, we can 
implement Kafka sink exactly-once semantics with transaction Kafka producer

> Structured Streaming Kafka sink support Exactly-Once semantics
> --
>
> Key: SPARK-28908
> URL: https://issues.apache.org/jira/browse/SPARK-28908
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: wenxuanguan
>Priority: Major
>
> Since Apache Kafka supports transaction from 0.11.0.0, we can implement Kafka 
> sink exactly-once semantics with transaction Kafka producer



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice

2019-08-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918371#comment-16918371
 ] 

Hyukjin Kwon commented on SPARK-28678:
--

It doesn't have to assign someone. If you open a PR, then it will automatically 
assign.

> Specify that start index is 1-based in docstring of 
> pyspark.sql.functions.slice
> ---
>
> Key: SPARK-28678
> URL: https://issues.apache.org/jira/browse/SPARK-28678
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sivam Pasupathipillai
>Priority: Trivial
>
> The start index parameter in pyspark.sql.functions.slice should be 1-based, 
> otherwise the method fails with an exception.
> This is not documented anywhere. Fix the docstring.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28907) Review invalid usage of new Configuration()

2019-08-29 Thread Xianjin YE (Jira)

Xianjin YE created SPARK-28907:
--

 Summary: Review invalid usage of new Configuration()
 Key: SPARK-28907
 URL: https://issues.apache.org/jira/browse/SPARK-28907
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Xianjin YE


As [~LI,Xiao] mentioned in 
[https://github.com/apache/spark/pull/25002#pullrequestreview-279392994] there 
may be other incorrect usage of new Configuration() which leads to unexpected 
result.

 

This jira tries to review these usage in Spark and replaces them if necessary.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice

2019-08-29 Thread Ting Yang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918341#comment-16918341
 ] 

Ting Yang commented on SPARK-28678:
---

I'd like to work on this, can you assign it to me. [~hyukjin.kwon] Thanks

> Specify that start index is 1-based in docstring of 
> pyspark.sql.functions.slice
> ---
>
> Key: SPARK-28678
> URL: https://issues.apache.org/jira/browse/SPARK-28678
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sivam Pasupathipillai
>Priority: Trivial
>
> The start index parameter in pyspark.sql.functions.slice should be 1-based, 
> otherwise the method fails with an exception.
> This is not documented anywhere. Fix the docstring.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException

2019-08-29 Thread Saisai Shao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918306#comment-16918306
 ] 

Saisai Shao commented on SPARK-28340:
-

We also saw a bunch of exceptions in our production environment. Looks like it 
is hard to prevent unless we change to not use `interrupt`, maybe we can just 
ignore logging such exceptions.

> Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught 
> exception while reverting partial writes to file: 
> java.nio.channels.ClosedByInterruptException"
> 
>
> Key: SPARK-28340
> URL: https://issues.apache.org/jira/browse/SPARK-28340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> If a Spark task is killed while writing blocks to disk (due to intentional 
> job kills, automated killing of redundant speculative tasks, etc) then Spark 
> may log exceptions like
> {code:java}
> 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file /
> java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}
> If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled 
> task can result in hundreds of these stacktraces being logged.
> Here are some StackOverflow questions asking about this:
>  * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
>  * 
> [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
>  * 
> [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
>  * 
> [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
>  
> Can we prevent this exception from occurring? If not, can we treat this 
> "expected exception" in a special manner to avoid log spam? My concern is 
> that the presence of large numbers of spurious exceptions is confusing to 
> users when they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

74 matches

Mail list logo