[jira] [Commented] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class

2019-09-16 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931116#comment-16931116
 ] 

angerszhu commented on SPARK-29022:
---

Since after spark-2.0, when we call method HiveClientImpl#withHiveState, it 
will set back origin classLoader of HiveClientImpl's state:
{code:java}
def withHiveState[A](f: => A): A = retryLocked {
  val original = Thread.currentThread().getContextClassLoader
  val originalConfLoader = state.getConf.getClassLoader
  // The classloader in clientLoader could be changed after addJar, always use 
the latest
  // classloader. We explicitly set the context class loader since 
"conf.setClassLoader" does
  // not do that, and the Hive client libraries may need to load classes 
defined by the client's
  // class loader.
  Thread.currentThread().setContextClassLoader(clientLoader.classLoader)
  state.getConf.setClassLoader(clientLoader.classLoader)
  // Set the thread local metastore client to the client associated with this 
HiveClientImpl.
  Hive.set(client)
  // Replace conf in the thread local Hive with current conf
  Hive.get(conf)
  // setCurrentSessionState will use the classLoader associated
  // with the HiveConf in `state` to override the context class loader of the 
current
  // thread.
  shim.setCurrentSessionState(state)
  val ret = try f finally {
state.getConf.setClassLoader(originalConfLoader)
Thread.currentThread().setContextClassLoader(original)
HiveCatalogMetrics.incrementHiveClientCalls(1)
  }
  ret
}
{code}
Before version 2.0, it will just set state's conf's classloader as 
clientLoader.classLoader, won't setback.

 
{code:java}
/**
 * Runs `f` with ThreadLocal session state and classloaders configured for this 
version of hive.
 */
def withHiveState[A](f: => A): A = retryLocked {
  val original = Thread.currentThread().getContextClassLoader
  // Set the thread local metastore client to the client associated with this 
HiveClientImpl.
  Hive.set(client)
  // The classloader in clientLoader could be changed after addJar, always use 
the latest
  // classloader
  state.getConf.setClassLoader(clientLoader.classLoader)
  // setCurrentSessionState will use the classLoader associated
  // with the HiveConf in `state` to override the context class loader of the 
current
  // thread.
  shim.setCurrentSessionState(state)
  val ret = try f finally {
Thread.currentThread().setContextClassLoader(original)
  }
  ret
}
{code}

> SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
> 
>
> Key: SPARK-29022
> URL: https://issues.apache.org/jira/browse/SPARK-29022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2019-09-17-13-54-50-896.png
>
>
> Spark SQL CLI can't use class in jars add by SQL 'ADD JAR'
> {code:java}
> spark-sql> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> ADD JAR 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
> spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> spark-sql> select * from addJar;
> 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> 

[jira] [Updated] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class

2019-09-16 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29022:
--
Attachment: image-2019-09-17-13-54-50-896.png

> SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
> 
>
> Key: SPARK-29022
> URL: https://issues.apache.org/jira/browse/SPARK-29022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2019-09-17-13-54-50-896.png
>
>
> Spark SQL CLI can't use class in jars add by SQL 'ADD JAR'
> {code:java}
> spark-sql> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> ADD JAR 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
> spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> spark-sql> select * from addJar;
> 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:367)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
>   at 
> 

[jira] [Commented] (SPARK-28467) Tests failed if there are not enough executors up before running

2019-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1693#comment-1693
 ] 

Dongjoon Hyun commented on SPARK-28467:
---

Since this is not `Closed` status, you can reopen this with that new 
information, [~kabhwan].

> Tests failed if there are not enough executors up before running
> 
>
> Key: SPARK-28467
> URL: https://issues.apache.org/jira/browse/SPARK-28467
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> We ran unit tests on arm64 instance, and there are tests failed due to the 
> executor can't up under the timeout 1 ms:
> - test driver discovery under local-cluster mode *** FAILED ***
>   java.util.concurrent.TimeoutException: Can't find 1 executors before 1 
> milliseconds elapsed
>   at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   
> - test gpu driver resource files and discovery under local-cluster mode *** 
> FAILED ***
>   java.util.concurrent.TimeoutException: Can't find 1 executors before 1 
> milliseconds elapsed
>   at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> And then we increase the timeout to 2(or 3) the tests passed, I found 
> there are other issues about the timeout increasing before, see: 
> https://issues.apache.org/jira/browse/SPARK-7989 and 
> https://issues.apache.org/jira/browse/SPARK-10651 
> I think the timeout doesn't work well, and seems there is no principle of the 
> timeout setting, how can I fix this? Could I increase the timeout for these 
> two tests?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2019-09-16 Thread Rajiv Bandi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931110#comment-16931110
 ] 

Rajiv Bandi commented on SPARK-6305:


Hi All,

          I (at my work) am facing an issues with logs in my spark application 
randomly not rolling properly. Also, our security team is flagging log4j 1.x as 
a security risk. I have tried excluding log4j 1.x and include log4j2 in spark 
libraries but it did not work. Hence, I would like to know if Spark team is 
considering to move away from log4j 1.x. I am asking this question as I did not 
see any updates here in the last 1 year.

Thanks

Rajiv

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class

2019-09-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931094#comment-16931094
 ] 

Yuming Wang commented on SPARK-29022:
-

It works before Spark 2.0.

> SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
> 
>
> Key: SPARK-29022
> URL: https://issues.apache.org/jira/browse/SPARK-29022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL CLI can't use class in jars add by SQL 'ADD JAR'
> {code:java}
> spark-sql> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> ADD JAR 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
> spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> spark-sql> select * from addJar;
> 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:367)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
>   at 
> 

[jira] [Commented] (SPARK-29110) Add window.sql - Part 4

2019-09-16 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931093#comment-16931093
 ] 

Sandeep Katta commented on SPARK-29110:
---

[~DylanGuedes] I see in the sql file some of the commands are not supported by 
SPARK , so what is  your expectation.  do we need to modify the SQL statement 
as per SPARK and test ?

> Add window.sql - Part 4
> ---
>
> Key: SPARK-29110
> URL: https://issues.apache.org/jira/browse/SPARK-29110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class

2019-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29022:

Affects Version/s: 2.0.2
   2.1.3
   2.2.3
   2.3.4
   2.4.4

> SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
> 
>
> Key: SPARK-29022
> URL: https://issues.apache.org/jira/browse/SPARK-29022
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL CLI can't use class in jars add by SQL 'ADD JAR'
> {code:java}
> spark-sql> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> ADD JAR 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
> spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> spark-sql> select * from addJar;
> 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:367)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
>   

[jira] [Updated] (SPARK-29022) SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class

2019-09-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29022:

Parent: (was: SPARK-23710)
Issue Type: Bug  (was: Sub-task)

> SparkSQLCLI can not use 'ADD JAR' 's jar as Serder class
> 
>
> Key: SPARK-29022
> URL: https://issues.apache.org/jira/browse/SPARK-29022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark SQL CLI can't use class in jars add by SQL 'ADD JAR'
> {code:java}
> spark-sql> add jar 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar;
> ADD JAR 
> /root/.m2/repository/org/apache/hive/hcatalog/hive-hcatalog-core/2.3.6/hive-hcatalog-core-2.3.6.jar
> spark-sql> CREATE TABLE addJar(key string) ROW FORMAT SERDE 
> 'org.apache.hive.hcatalog.data.JsonSerDe';
> spark-sql> select * from addJar;
> 19/09/07 03:06:54 ERROR SparkSQLDriver: Failed in [select * from addJar]
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.hive.hcatalog.data.JsonSerDe
>   at 
> org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:79)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader$lzycompute(HiveTableScanExec.scala:110)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopReader(HiveTableScanExec.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.$anonfun$doExecute$1(HiveTableScanExec.scala:188)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2488)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:188)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:329)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:378)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:408)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:52)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:367)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:920)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-29110) Add window.sql - Part 4

2019-09-16 Thread Sandeep Katta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta updated SPARK-29110:
--
Comment: was deleted

(was: [~DylanGuedes] thank you for this, soon I will raise the PR for this)

> Add window.sql - Part 4
> ---
>
> Key: SPARK-29110
> URL: https://issues.apache.org/jira/browse/SPARK-29110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29110) Add window.sql - Part 4

2019-09-16 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931082#comment-16931082
 ] 

Sandeep Katta commented on SPARK-29110:
---

[~DylanGuedes] thank you for this, soon I will raise the PR for this

> Add window.sql - Part 4
> ---
>
> Key: SPARK-29110
> URL: https://issues.apache.org/jira/browse/SPARK-29110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29109) Add window.sql - Part 3

2019-09-16 Thread Udbhav Agrawal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931073#comment-16931073
 ] 

Udbhav Agrawal commented on SPARK-29109:


i will work on this


> Add window.sql - Part 3
> ---
>
> Key: SPARK-29109
> URL: https://issues.apache.org/jira/browse/SPARK-29109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset size is big

2019-09-16 Thread Saisai Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-29114:

Priority: Major  (was: Blocker)

> Dataset.coalesce(10) throw ChunkFetchFailureException when original 
> Dataset size is big
> 
>
> Key: SPARK-29114
> URL: https://issues.apache.org/jira/browse/SPARK-29114
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.0
>Reporter: ZhanxiongWang
>Priority: Major
>
> I create a Dataset df with 200 partitions. I applied for 100 executors 
> for my task. Each executor with 1 core, and driver memory is 8G executor is 
> 16G. I use df.cache() before df.coalesce(10). When{color:#de350b} 
> Dataset{color} {color:#de350b}size is small{color}, the program works 
> well. But when I {color:#de350b}increase{color} the size of the Dataset, 
> the function {color:#de350b}df.coalesce(10){color} will throw 
> ChunkFetchFailureException.
> 19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
> 19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
> 19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and 
> clearing cache
> 19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 
> 1003
> 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as 
> bytes in memory (estimated size 49.4 KB, free 3.8 GB)
> 19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 
> 7 ms
> 19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
> memory (estimated size 154.5 KB, free 3.8 GB)
> 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
> 19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
> 19/09/17 08:26:44 INFO TransportClientFactory: Successfully created 
> connection to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
> 19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
> rdd_1005_18, and will not retry (0 retries)
> org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
> fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
> writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-2137154997))
>  at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>  at java.lang.Thread.run(Thread.java:745)
> 19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
> failures. Most recent failure cause:
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at 

[jira] [Commented] (SPARK-29108) Add window.sql - Part 2

2019-09-16 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931071#comment-16931071
 ] 

pavithra ramachandran commented on SPARK-29108:
---

i shall work on this

> Add window.sql - Part 2
> ---
>
> Key: SPARK-29108
> URL: https://issues.apache.org/jira/browse/SPARK-29108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29114) Dataset.coalesce(10) throw ChunkFetchFailureException when original Dataset size is big

2019-09-16 Thread ZhanxiongWang (Jira)
ZhanxiongWang created SPARK-29114:
-

 Summary: Dataset.coalesce(10) throw 
ChunkFetchFailureException when original Dataset size is big
 Key: SPARK-29114
 URL: https://issues.apache.org/jira/browse/SPARK-29114
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 2.3.0
Reporter: ZhanxiongWang


I create a Dataset df with 200 partitions. I applied for 100 executors for 
my task. Each executor with 1 core, and driver memory is 8G executor is 16G. I 
use df.cache() before df.coalesce(10). When{color:#de350b} Dataset{color} 
{color:#de350b}size is small{color}, the program works well. But when I 
{color:#de350b}increase{color} the size of the Dataset, the function 
{color:#de350b}df.coalesce(10){color} will throw ChunkFetchFailureException.

19/09/17 08:26:44 INFO CoarseGrainedExecutorBackend: Got assigned task 210
19/09/17 08:26:44 INFO Executor: Running task 0.0 in stage 3.0 (TID 210)
19/09/17 08:26:44 INFO MapOutputTrackerWorker: Updating epoch to 1 and clearing 
cache
19/09/17 08:26:44 INFO TorrentBroadcast: Started reading broadcast variable 1003
19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003_piece0 stored as bytes 
in memory (estimated size 49.4 KB, free 3.8 GB)
19/09/17 08:26:44 INFO TorrentBroadcast: Reading broadcast variable 1003 took 7 
ms
19/09/17 08:26:44 INFO MemoryStore: Block broadcast_1003 stored as values in 
memory (estimated size 154.5 KB, free 3.8 GB)
19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_0 locally
19/09/17 08:26:44 INFO BlockManager: Found block rdd_1005_1 locally
19/09/17 08:26:44 INFO TransportClientFactory: Successfully created connection 
to /100.76.29.130:54238 after 1 ms (0 ms spent in bootstraps)
19/09/17 08:26:46 ERROR RetryingBlockFetcher: Failed to fetch block 
rdd_1005_18, and will not retry (0 retries)
org.apache.spark.network.client.ChunkFetchFailureException: Failure while 
fetching StreamChunkId\{streamId=69368607002, chunkIndex=0}: readerIndex: 0, 
writerIndex: -2137154997 (expected: 0 <= readerIndex <= writerIndex <= 
capacity(-2137154997))
 at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:182)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
 at java.lang.Thread.run(Thread.java:745)
19/09/17 08:26:46 WARN BlockManager: Failed to fetch block after 1 fetch 
failures. Most recent failure cause:
org.apache.spark.SparkException: Exception thrown in awaitResult: 
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
 at 
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:115)
 at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:691)
 at 
org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:634)
 at org.apache.spark.storage.BlockManager.get(BlockManager.scala:747)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:802)
 at 

[jira] [Updated] (SPARK-29104) Fix Flaky Test - PipedRDDSuite. stdin_writer_thread_should_be_exited_when_task_is_finished

2019-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29104:
--
Description: 
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6867/testReport/junit/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/



  was:
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/




> Fix Flaky Test - PipedRDDSuite. 
> stdin_writer_thread_should_be_exited_when_task_is_finished
> --
>
> Key: SPARK-29104
> URL: https://issues.apache.org/jira/browse/SPARK-29104
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6867/testReport/junit/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29107) Add window.sql - Part 1

2019-09-16 Thread Sharanabasappa G Keriwaddi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931064#comment-16931064
 ] 

Sharanabasappa G Keriwaddi commented on SPARK-29107:


I will take up this Jira.

> Add window.sql - Part 1
> ---
>
> Key: SPARK-29107
> URL: https://issues.apache.org/jira/browse/SPARK-29107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread Adrian Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931057#comment-16931057
 ] 

Adrian Wang commented on SPARK-13446:
-

or you can just apply the patch from SPARK-27349 and recompile your spark. Hope 
it works!

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread Adrian Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931055#comment-16931055
 ] 

Adrian Wang commented on SPARK-13446:
-

[~jpbordi][~headcra6] I am using mysql as hive metastore backend, leaving 
1.2.1hive jar in my spark/jars directory, without putting any additional hive 
jars in there, and reading from hive 2.x metastore service, it just works fine.

```
hive-beeline-1.2.1.spark2.jar  hive-cli-1.2.1.spark2.jar  
hive-exec-1.2.1.spark2.jar  hive-jdbc-1.2.1.spark2.jar  
hive-metastore-1.2.1.spark2.jar
```

That what it returns with `ls $SPARK_HOME/jars/hive-*`


> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29112) Expose more details when ApplicationMaster reporter faces a fatal exception

2019-09-16 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29112:
---
Comment: was deleted

(was: https://github.corp.ebay.com/carmel/spark/pull/551)

> Expose more details when ApplicationMaster reporter faces a fatal exception
> ---
>
> Key: SPARK-29112
> URL: https://issues.apache.org/jira/browse/SPARK-29112
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In {{ApplicationMaster.Reporter}} thread, fatal exception information is 
> swallowed. It's better to expose.
> A thrift server was shutdown due to some fatal exception but no useful 
> information from log.
> {code}
> 19/09/16 06:59:54,498 INFO [Reporter] yarn.ApplicationMaster:54 : Final app 
> status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from 
> Reporter thread.)
> 19/09/16 06:59:54,500 ERROR [Driver] thriftserver.HiveThriftServer2:91 : 
> Error starting HiveThriftServer2
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:160)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:708)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29112) Expose more details when ApplicationMaster reporter faces a fatal exception

2019-09-16 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931049#comment-16931049
 ] 

Lantao Jin commented on SPARK-29112:


https://github.corp.ebay.com/carmel/spark/pull/551

> Expose more details when ApplicationMaster reporter faces a fatal exception
> ---
>
> Key: SPARK-29112
> URL: https://issues.apache.org/jira/browse/SPARK-29112
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Minor
>
> In {{ApplicationMaster.Reporter}} thread, fatal exception information is 
> swallowed. It's better to expose.
> A thrift server was shutdown due to some fatal exception but no useful 
> information from log.
> {code}
> 19/09/16 06:59:54,498 INFO [Reporter] yarn.ApplicationMaster:54 : Final app 
> status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from 
> Reporter thread.)
> 19/09/16 06:59:54,500 ERROR [Driver] thriftserver.HiveThriftServer2:91 : 
> Error starting HiveThriftServer2
> java.lang.InterruptedException: sleep interrupted
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:160)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:708)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29113) Some annotation errors in ApplicationCache.scala

2019-09-16 Thread feiwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29113:

Issue Type: Documentation  (was: Bug)

> Some annotation errors in ApplicationCache.scala
> 
>
> Key: SPARK-29113
> URL: https://issues.apache.org/jira/browse/SPARK-29113
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29113) Some annotation errors in ApplicationCache.scala

2019-09-16 Thread feiwang (Jira)
feiwang created SPARK-29113:
---

 Summary: Some annotation errors in ApplicationCache.scala
 Key: SPARK-29113
 URL: https://issues.apache.org/jira/browse/SPARK-29113
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29112) Expose more details when ApplicationMaster reporter faces a fatal exception

2019-09-16 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-29112:
--

 Summary: Expose more details when ApplicationMaster reporter faces 
a fatal exception
 Key: SPARK-29112
 URL: https://issues.apache.org/jira/browse/SPARK-29112
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.4, 3.0.0
Reporter: Lantao Jin


In {{ApplicationMaster.Reporter}} thread, fatal exception information is 
swallowed. It's better to expose.
A thrift server was shutdown due to some fatal exception but no useful 
information from log.
{code}
19/09/16 06:59:54,498 INFO [Reporter] yarn.ApplicationMaster:54 : Final app 
status: FAILED, exitCode: 12, (reason: Exception was thrown 1 time(s) from 
Reporter thread.)
19/09/16 06:59:54,500 ERROR [Driver] thriftserver.HiveThriftServer2:91 : Error 
starting HiveThriftServer2
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:160)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:708)
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29100.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25806
[https://github.com/apache/spark/pull/25806]

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28870) Snapshot old event log files to support compaction

2019-09-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28870:
-
Description: 
This issue tracks the effort on compacting old event log files into snapshot 
and achieve both goals, 1) reduce overall event log file size 2) speed up 
replaying event logs. It also deals with cleaning event log files as snapshot 
will provide the safe way to clean up old event log files without losing 
ability to replay whole event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
adds the ability to snapshot/restore from/to KVStore.

  was:
This issue tracks the effort on compacting old event log files into snapshot 
and achieve both goals, 1) reduce overall event log file size 2) speed up 
replaying event logs. It also deals with cleaning event log files as snapshot 
will provide the safe way to clean up old event log files without losing 
ability to replay whole event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files.


> Snapshot old event log files to support compaction
> --
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on compacting old event log files into snapshot 
> and achieve both goals, 1) reduce overall event log file size 2) speed up 
> replaying event logs. It also deals with cleaning event log files as snapshot 
> will provide the safe way to clean up old event log files without losing 
> ability to replay whole event logs.
> This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
> event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
> adds the ability to snapshot/restore from/to KVStore.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28870) Snapshot old event log files to support compaction

2019-09-16 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-28870:
-
Description: 
This issue tracks the effort on compacting old event log files into snapshot 
and achieve both goals, 1) reduce overall event log file size 2) speed up 
replaying event logs. It also deals with cleaning event log files as snapshot 
will provide the safe way to clean up old event log files without losing 
ability to replay whole event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
will add the ability to snapshot/restore from/to KVStore.

  was:
This issue tracks the effort on compacting old event log files into snapshot 
and achieve both goals, 1) reduce overall event log file size 2) speed up 
replaying event logs. It also deals with cleaning event log files as snapshot 
will provide the safe way to clean up old event log files without losing 
ability to replay whole event logs.

This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
adds the ability to snapshot/restore from/to KVStore.


> Snapshot old event log files to support compaction
> --
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on compacting old event log files into snapshot 
> and achieve both goals, 1) reduce overall event log file size 2) speed up 
> replaying event logs. It also deals with cleaning event log files as snapshot 
> will provide the safe way to clean up old event log files without losing 
> ability to replay whole event logs.
> This issue will be on top of SPARK-28869 as SPARK-28869 will create rolled 
> event log files. This issue will be also on top of SPARK-29111 as SPARK-29111 
> will add the ability to snapshot/restore from/to KVStore.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29111) Support snapshot/restore of KVStore

2019-09-16 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-29111:


 Summary: Support snapshot/restore of KVStore
 Key: SPARK-29111
 URL: https://issues.apache.org/jira/browse/SPARK-29111
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the effort of supporting snapshot/restore from/to KVStore.

Note that this issue will not touch current behavior - following issue will 
leverage the output of this issue. This is to reduce the size of each PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28467) Tests failed if there are not enough executors up before running

2019-09-16 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931024#comment-16931024
 ] 

Jungtaek Lim commented on SPARK-28467:
--

Looks like it is flaky in Amplab PR builder.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110681/testReport/]

Would it be better to file a new issue to track this?

> Tests failed if there are not enough executors up before running
> 
>
> Key: SPARK-28467
> URL: https://issues.apache.org/jira/browse/SPARK-28467
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> We ran unit tests on arm64 instance, and there are tests failed due to the 
> executor can't up under the timeout 1 ms:
> - test driver discovery under local-cluster mode *** FAILED ***
>   java.util.concurrent.TimeoutException: Can't find 1 executors before 1 
> milliseconds elapsed
>   at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   
> - test gpu driver resource files and discovery under local-cluster mode *** 
> FAILED ***
>   java.util.concurrent.TimeoutException: Can't find 1 executors before 1 
> milliseconds elapsed
>   at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761)
>   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
> And then we increase the timeout to 2(or 3) the tests passed, I found 
> there are other issues about the timeout increasing before, see: 
> https://issues.apache.org/jira/browse/SPARK-7989 and 
> https://issues.apache.org/jira/browse/SPARK-10651 
> I think the timeout doesn't work well, and seems there is no principle of the 
> timeout setting, how can I fix this? Could I increase the timeout for these 
> two tests?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames

2019-09-16 Thread Xianyin Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931017#comment-16931017
 ] 

Xianyin Xin commented on SPARK-29049:
-

[~hyukjin.kwon], I updated the description. PR 
[https://github.com/apache/spark/pull/25626|https://github.com/apache/spark/pull/25626]
 will use it to normalize the attribute in {{Expression}}s.

> Rename DataSourceStrategy#normalizeFilters to 
> DataSourceStrategy#normalizeAttrNames
> ---
>
> Key: SPARK-29049
> URL: https://issues.apache.org/jira/browse/SPARK-29049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> DataSourceStrategy#normalizeFilters can also be used to normalize attributes 
> in {{Expression}}, not limit to {{Filter}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames

2019-09-16 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-29049:

Description: DataSourceStrategy#normalizeFilters can also be used to 
normalize attributes in {{Expression}}, not limit to {{Filter}}.  (was: 
DataSourceStrategy#normalizeFilters can also be used to normalize attributes in 
\{{Expression}} s, not limit to \{{Filter}}s.)

> Rename DataSourceStrategy#normalizeFilters to 
> DataSourceStrategy#normalizeAttrNames
> ---
>
> Key: SPARK-29049
> URL: https://issues.apache.org/jira/browse/SPARK-29049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> DataSourceStrategy#normalizeFilters can also be used to normalize attributes 
> in {{Expression}}, not limit to {{Filter}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames

2019-09-16 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-29049:

Description: DataSourceStrategy#normalizeFilters can also be used to 
normalize attributes in {{Expression}}s, not limit to {{Filter}}s.  (was: 
DataSourceStrategy#normalizeFilters can also be used to normalize attributes in 
`Expression`s, not limit to `Filter`s.)

> Rename DataSourceStrategy#normalizeFilters to 
> DataSourceStrategy#normalizeAttrNames
> ---
>
> Key: SPARK-29049
> URL: https://issues.apache.org/jira/browse/SPARK-29049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> DataSourceStrategy#normalizeFilters can also be used to normalize attributes 
> in {{Expression}}s, not limit to {{Filter}}s.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames

2019-09-16 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-29049:

Description: DataSourceStrategy#normalizeFilters can also be used to 
normalize attributes in {{Expression}} s, not limit to {{Filter}} s.  (was: 
DataSourceStrategy#normalizeFilters can also be used to normalize attributes in 
{{Expression}}s, not limit to {{Filter}}s.)

> Rename DataSourceStrategy#normalizeFilters to 
> DataSourceStrategy#normalizeAttrNames
> ---
>
> Key: SPARK-29049
> URL: https://issues.apache.org/jira/browse/SPARK-29049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> DataSourceStrategy#normalizeFilters can also be used to normalize attributes 
> in {{Expression}} s, not limit to {{Filter}} s.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames

2019-09-16 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-29049:

Description: DataSourceStrategy#normalizeFilters can also be used to 
normalize attributes in \{{Expression}} s, not limit to \{{Filter}}s.  (was: 
DataSourceStrategy#normalizeFilters can also be used to normalize attributes in 
{{Expression}} s, not limit to {{Filter}} s.)

> Rename DataSourceStrategy#normalizeFilters to 
> DataSourceStrategy#normalizeAttrNames
> ---
>
> Key: SPARK-29049
> URL: https://issues.apache.org/jira/browse/SPARK-29049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> DataSourceStrategy#normalizeFilters can also be used to normalize attributes 
> in \{{Expression}} s, not limit to \{{Filter}}s.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames

2019-09-16 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-29049:

Description: DataSourceStrategy#normalizeFilters can also be used to 
normalize attributes in `Expression`s, not limit to `Filter`s.

> Rename DataSourceStrategy#normalizeFilters to 
> DataSourceStrategy#normalizeAttrNames
> ---
>
> Key: SPARK-29049
> URL: https://issues.apache.org/jira/browse/SPARK-29049
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> DataSourceStrategy#normalizeFilters can also be used to normalize attributes 
> in `Expression`s, not limit to `Filter`s.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29008) Define an individual method for each common subexpression in HashAggregateExec

2019-09-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29008.
--
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/25710]

> Define an individual method for each common subexpression in HashAggregateExec
> --
>
> Key: SPARK-29008
> URL: https://issues.apache.org/jira/browse/SPARK-29008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> In the current master, the common subexpr elimination code in 
> HashAggregateExec is expanded in a single method: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L397]
> This method size can be too big for JIT, so we can define an individual 
> method for each common subexpression.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29047) use spark-submit Not a file: hdfs://

2019-09-16 Thread bailin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931013#comment-16931013
 ] 

bailin commented on SPARK-29047:


i solved, it's code problem.   Generate and read parquet,use 
"config("spark.sql.sources.partitionColumnTypeInference.enabled","false"  both

 

> use  spark-submit  Not a file: hdfs://
> --
>
> Key: SPARK-29047
> URL: https://issues.apache.org/jira/browse/SPARK-29047
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.3
> Environment: spark1.3 ;hadoop2.6
>Reporter: bailin
>Priority: Critical
> Attachments: hadfs file is already existed.png
>
>
> when i submit a spark application : 
> {code:java}
> /spark-submit --class com.yto.log.SparkStatCleanJobYARN --name 
> TopNStatJobYARN --master yarn --executor-memory 1G  --num-executors 1 
> /home/hadoop/lib/sql-1.0-jar-with-dependencies.jar 
> hdfs://hadoop001:8020/clean 20161110{code}
> the hadfs file is already existed,but not recognition 
>  
> {code:java}
> 19/09/11 14:18:50 INFO mapred.FileInputFormat: Total input paths to process : 
> 219/09/11 14:18:50 INFO mapred.FileInputFormat: Total input paths to process 
> : 219/09/11 14:18:50 ERROR datasources.FileFormatWriter: Aborting job 
> null.java.io.IOException: Not a file: hdfs://hadoop001:8020/clean/day= at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29108) Add window.sql - Part 2

2019-09-16 Thread Dylan Guedes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-29108:
-
Description: In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562]
  (was: In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319])

> Add window.sql - Part 2
> ---
>
> Key: SPARK-29108
> URL: https://issues.apache.org/jira/browse/SPARK-29108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29109) Add window.sql - Part 3

2019-09-16 Thread Dylan Guedes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-29109:
-
Description: In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911]
  (was: In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319])

> Add window.sql - Part 3
> ---
>
> Key: SPARK-29109
> URL: https://issues.apache.org/jira/browse/SPARK-29109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29110) Add window.sql - Part 4

2019-09-16 Thread Dylan Guedes (Jira)
Dylan Guedes created SPARK-29110:


 Summary: Add window.sql - Part 4
 Key: SPARK-29110
 URL: https://issues.apache.org/jira/browse/SPARK-29110
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dylan Guedes
 Fix For: 3.0.0


In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29110) Add window.sql - Part 4

2019-09-16 Thread Dylan Guedes (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-29110:
-
Description: In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259]
  (was: In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319])

> Add window.sql - Part 4
> ---
>
> Key: SPARK-29110
> URL: https://issues.apache.org/jira/browse/SPARK-29110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
> Fix For: 3.0.0
>
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29109) Add window.sql - Part 3

2019-09-16 Thread Dylan Guedes (Jira)
Dylan Guedes created SPARK-29109:


 Summary: Add window.sql - Part 3
 Key: SPARK-29109
 URL: https://issues.apache.org/jira/browse/SPARK-29109
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dylan Guedes
 Fix For: 3.0.0


In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29108) Add window.sql - Part 2

2019-09-16 Thread Dylan Guedes (Jira)
Dylan Guedes created SPARK-29108:


 Summary: Add window.sql - Part 2
 Key: SPARK-29108
 URL: https://issues.apache.org/jira/browse/SPARK-29108
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dylan Guedes
 Fix For: 3.0.0


In this ticket, we plan to add the regression test cases of 
[https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29107) Add window.sql - Part 1

2019-09-16 Thread Dylan Guedes (Jira)
Dylan Guedes created SPARK-29107:


 Summary: Add window.sql - Part 1
 Key: SPARK-29107
 URL: https://issues.apache.org/jira/browse/SPARK-29107
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dylan Guedes
 Fix For: 3.0.0


In this ticket, we plan to add the regression test cases of 
https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29106) Add jenkins arm test for spark

2019-09-16 Thread huangtianhua (Jira)
huangtianhua created SPARK-29106:


 Summary: Add jenkins arm test for spark
 Key: SPARK-29106
 URL: https://issues.apache.org/jira/browse/SPARK-29106
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.4
Reporter: huangtianhua


Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to amplab 
to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25216) Provide better error message when a column contains dot and needs backticks quote

2019-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25216.
--
Resolution: Duplicate

> Provide better error message when a column contains dot and needs backticks 
> quote
> -
>
> Key: SPARK-25216
> URL: https://issues.apache.org/jira/browse/SPARK-25216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Li Jin
>Priority: Major
>
> The current error message is  often confusing to a new Spark user that a 
> column containing "." needs backticks quote. 
> For example, consider the following code:
> {code:java}
> spark.range(0, 1).toDF('a.b')['a.b']{code}
> the current message looks like:
> {code:java}
> Cannot resolve column name "a.b" among (a.b)
> {code}
> We could improve the error message to, e.g: 
> {code:java}
> Cannot resolve column name "a.b" among (a.b). Try adding backticks to the 
> column name, i.e., `a.b`;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26781) Additional exchange gets added

2019-09-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26781.
--
Resolution: Duplicate

> Additional exchange gets added 
> ---
>
> Key: SPARK-26781
> URL: https://issues.apache.org/jira/browse/SPARK-26781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Karuppayya
>Priority: Major
>
> Consider three tables: a(id int), b(id int), c(id, int)
> query:  
> {code:java}
> select * from (select a.id as newid from a join b where a.id = b.id) temp 
> join c on temp.newid = c.id
> {code}
> Plan(physical plan: 
> org.apache.spark.sql.execution.QueryExecution#executedPlan):
>  
> {noformat}
> *(9) SortMergeJoin [newid#1L], [id#6L], Inner
> :- *(6) Sort [newid#1L ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(newid#1L, 200)
> : +- *(5) Project [id#2L AS newid#1L, name#3]
> : +- *(5) SortMergeJoin [id#2L], [id#4L], Inner
> : :- *(2) Sort [id#2L ASC NULLS FIRST], false, 0
> : : +- Exchange hashpartitioning(id#2L, 200)
> : : +- *(1) Project [id#2L, name#3]
> : : +- *(1) Filter isnotnull(id#2L)
> : : +- *(1) FileScan parquet a[id#2L,name#3] Batched: true, DataFilters: 
> [isnotnull(id#2L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/tmp/spark/a], PartitionFilters: [], PushedFilters: 
> [IsNotNull(id)], ReadSchema: struct
> : +- *(4) Sort [id#4L ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(id#4L, 200)
> : +- *(3) Project [id#4L]
> : +- *(3) Filter isnotnull(id#4L)
> : +- *(3) FileScan parquet b[id#4L] Batched: true, DataFilters: 
> [isnotnull(id#4L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/tmp/spark/b], PartitionFilters: [], PushedFilters: 
> [IsNotNull(id)], ReadSchema: struct
> +- *(8) Sort [id#6L ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(id#6L, 200)
> +- *(7) Project [id#6L, name#7]
> +- *(7) Filter isnotnull(id#6L)
> +- *(7) FileScan parquet \c[id#6L,name#7] Batched: true, DataFilters: 
> [isnotnull(id#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/tmp/spark/c], PartitionFilters: [], PushedFilters: 
> [IsNotNull(id)], ReadSchema: struct{noformat}
>  
>  The exchange operator below stage 6 is not required since the data from 
> project is already partitioned based on id.
> An exchange gets added since the outputPartitioning of Project(5) is 
> HashPartitioning on id#2L whereas the requiredPartitioning of Sort(Stage 6) 
> is HashPartitioning on newid#1L which is nothing but alias of id#2L.
> The exchange operator is not required in this case if we are able to compare 
> the attribute id#2L referenced by alias newid#1L 0
> This issue happens in TPC-DS benchmark query - query#2  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29105) SHS may delete driver log file of in progress application

2019-09-16 Thread Marcelo Vanzin (Jira)
Marcelo Vanzin created SPARK-29105:
--

 Summary: SHS may delete driver log file of in progress application
 Key: SPARK-29105
 URL: https://issues.apache.org/jira/browse/SPARK-29105
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


There's an issue with how the SHS cleans driver logs that is similar to the 
problem of event logs: because the file size is not updated when you write to 
it, the SHS fails to detect activity and thus may delete the file while it's 
still being written to.

SPARK-24787 added a workaround in the SHS so that it can detect that situation 
for in-progress apps, replacing the previous solution which was too slow for 
event logs.

But that doesn't work for driver logs because they do not follow the same 
pattern (different file names for in-progress files), and thus would require 
the SHS to open the driver log files on every scan, which is expensive.

The old approach (using the {{hsync}} API) seems to be a good match for the 
driver logs, though, which don't slow down the listener bus like event logs do.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29104) Fix Flaky Test - PipedRDDSuite. stdin_writer_thread_should_be_exited_when_task_is_finished

2019-09-16 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29104:
-

 Summary: Fix Flaky Test - PipedRDDSuite. 
stdin_writer_thread_should_be_exited_when_task_is_finished
 Key: SPARK-29104
 URL: https://issues.apache.org/jira/browse/SPARK-29104
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0
Reporter: Dongjoon Hyun


- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/





--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26565.
---
Resolution: Not A Problem

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, add an {{if}} statement to 
> skip the following sections when run on jenkins:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> -another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.-



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29103) CheckAnalysis for data source V2 ALTER TABLE ignores case sensitivity

2019-09-16 Thread Jose Torres (Jira)
Jose Torres created SPARK-29103:
---

 Summary: CheckAnalysis for data source V2 ALTER TABLE ignores case 
sensitivity
 Key: SPARK-29103
 URL: https://issues.apache.org/jira/browse/SPARK-29103
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jose Torres


For each column referenced, we run

```val field = table.schema.findNestedField(fieldName, includeCollections = 
true)```

and fail analysis if the field isn't there. This check is always case-sensitive 
on column names, even if the underlying catalog is case insensitive, so it will 
sometimes throw on ALTER operations which the catalog supports.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25153) Improve error messages for columns with dots/periods

2019-09-16 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930896#comment-16930896
 ] 

Jeff Evans commented on SPARK-25153:


Opened a pull request for this (see link added by the bot).  Open to 
suggestions on the exact wording of the "suggestion", of course.

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29100:
--
Priority: Major  (was: Minor)

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29100:
--
Comment: was deleted

(was: Since this is a prevention of the potential bug situation, I lower the 
priority from Major to Minor. However, thank you, [~viirya]. This is a nice bug 
fix.)

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930871#comment-16930871
 ] 

Dongjoon Hyun commented on SPARK-29100:
---

Since this is a prevention of the potential bug situation, I lower the priority 
from Major to Minor. However, thank you, [~viirya]. This is a nice bug fix.

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29100:
--
Priority: Minor  (was: Major)

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22381) Add StringParam that supports valid options

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22381.
---
Resolution: Won't Fix

> Add StringParam that supports valid options
> ---
>
> Key: SPARK-22381
> URL: https://issues.apache.org/jira/browse/SPARK-22381
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> During test with https://issues.apache.org/jira/browse/SPARK-22331, I found 
> it might be a good idea to include the possible options in a StringParam.
> A StringParam extends Param[String] and allow user to specify the valid 
> options in Array[String] (case insensitive).
> So far it can help achieve three goals:
> 1. Make the StringParam aware of its possible options and support native 
> validations.
> 2. StringParam can list the supported options when user input wrong value.
> 3. allow automatic unit test coverage for case-insensitive String param
> and IMO it also decrease the code redundancy.
> The StringParam is designed to be completely compatible with existing 
> Param[String], just adding the extra logic for supporting options, which 
> means we don't need to convert all Param[String] to StringParam until we feel 
> comfortable to do that.
> 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10408) Autoencoder

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10408.
---
Resolution: Won't Fix

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Major
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22111.
---
Resolution: Won't Fix

> OnlineLDAOptimizer should filter out empty documents beforehand 
> 
>
> Key: SPARK-22111
> URL: https://issues.apache.org/jira/browse/SPARK-22111
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> OnlineLDAOptimizer should filter out empty documents beforehand in order to 
> make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered 
> corpus with all non-empty docs. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24806) Brush up generated code so that JDK Java compilers can handle it

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24806.
---
Resolution: Won't Fix

> Brush up generated code so that JDK Java compilers can handle it
> 
>
> Key: SPARK-24806
> URL: https://issues.apache.org/jira/browse/SPARK-24806
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930835#comment-16930835
 ] 

Nicholas Chammas commented on SPARK-29102:
--

cc [~cloud_fan] and [~hyukjin.kwon]: I noticed your work and comments on the 
PRs for SPARK-28366, so you may be interested in this issue.

Does this idea make sense? Does it seem feasible in theory at least?

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-16 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-29102:


 Summary: Read gzipped file into multiple partitions without full 
gzip expansion on a single-node
 Key: SPARK-29102
 URL: https://issues.apache.org/jira/browse/SPARK-29102
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


Large gzipped files are a common stumbling block for new users (SPARK-5685, 
SPARK-28366) and an ongoing pain point for users who must process such files 
delivered from external parties who can't or won't break them up into smaller 
files or compress them using a splittable compression format like bzip2.

To deal with large gzipped files today, users must either load them via a 
single task and then repartition the resulting RDD or DataFrame, or they must 
launch a preprocessing step outside of Spark to split up the file or recompress 
it using a splittable format. In either case, the user needs a single host 
capable of holding the entire decompressed file.

Spark can potentially a) spare new users the confusion over why only one task 
is processing their gzipped data, and b) relieve new and experienced users 
alike from needing to maintain infrastructure capable of decompressing a large 
gzipped file on a single node, by directly loading gzipped files into multiple 
partitions across the cluster.

The rough idea is to have tasks divide a given gzipped file into ranges and 
then have them all concurrently decompress the file, with each task throwing 
away the data leading up to the target range. (This kind of partial 
decompression is apparently [doable using standard Unix 
utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
doable in Spark too.)

In this way multiple tasks can concurrently load a single gzipped file into 
multiple partitions. Even though every task will need to unpack the file from 
the beginning to the task's target range, and the stage will run no faster than 
what it would take with Spark's current gzip loading behavior, this nonetheless 
addresses the two problems called out above. Users no longer need to load and 
then repartition gzipped files, and their infrastructure does not need to 
decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930827#comment-16930827
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Regarding to AUC unstable issue, the nondeterministic training data, if not 
causing ArrayIndexOutOfBoundsException, can also cause wrong matching in 
computing factors. I don't have evidence that it is the reason. But it is 
possible.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select("id", "dpid", "score", "tag")
> 

[jira] [Created] (SPARK-29101) CSV datasource returns incorrect .count() from file with malformed records

2019-09-16 Thread Stuart White (Jira)
Stuart White created SPARK-29101:


 Summary: CSV datasource returns incorrect .count() from file with 
malformed records
 Key: SPARK-29101
 URL: https://issues.apache.org/jira/browse/SPARK-29101
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Stuart White


Spark 2.4 introduced a change to the way csv files are read.  See [Upgrading 
From Spark SQL 2.3 to 
2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
 for more details.

In that document, it states: _To restore the previous behavior, set 
spark.sql.csv.parser.columnPruning.enabled to false._

I am configuring Spark 2.4.4 as such, yet I'm still getting results 
inconsistent with pre-2.4.  For example:

Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
records, and one malformed record.

{noformat}
fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx
{noformat}
 
With Spark 2.1.1, if I call .count() on a DataFrame created from this file 
(using option DROPMALFORMED), "3" is returned.

{noformat}
(using Spark 2.1.1)
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").count
19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx
res1: Long = 3
{noformat}

With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" option 
to false to restore the pre-2.4 behavior for handling malformed records, then 
call .count() and "4" is returned.

{noformat}
(using spark 2.4.4)
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4
{noformat}

So, using the *spark.sql.csv.parser.columnPruning.enabled* option did not 
actually restore previous behavior.

How can I, using Spark 2.4+, get a count of the records in a .csv which 
excludes malformed records?




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25721) maxRate configuration not being used in Kinesis receiver

2019-09-16 Thread Karthikeyan Ravi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930801#comment-16930801
 ] 

Karthikeyan Ravi commented on SPARK-25721:
--

Attaching screenshot  !Screen Shot 2019-09-16 at 12.27.25 PM.png!

> maxRate configuration not being used in Kinesis receiver
> 
>
> Key: SPARK-25721
> URL: https://issues.apache.org/jira/browse/SPARK-25721
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Zhaobo Yu
>Priority: Major
> Attachments: Screen Shot 2019-09-16 at 12.27.25 PM.png, 
> rate_violation.png
>
>
> In the onStart() function of KinesisReceiver class, the 
> KinesisClientLibConfiguration object is initialized in the following way, 
> val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(
>  checkpointAppName,
>  streamName,
>  kinesisProvider,
>  dynamoDBCreds.map(_.provider).getOrElse(kinesisProvider),
>  cloudWatchCreds.map(_.provider).getOrElse(kinesisProvider),
>  workerId)
>  .withKinesisEndpoint(endpointUrl)
>  .withInitialPositionInStream(initialPositionInStream)
>  .withTaskBackoffTimeMillis(500)
>  .withRegionName(regionName)
>  
> As you can see there is no withMaxRecords() in initialization, so 
> KinesisClientLibConfiguration will set it to 1 by default since it has 
> been hard coded as this way,
> public static final int DEFAULT_MAX_RECORDS = 1;
> In such a case, the receiver will not fulfill any maxRate setting we set if 
> it's less than 10k, worse still, it will cause 
> ProvisionedThroughputExceededException from Kinesis, especially when we 
> restart the streaming application. 
>  
> Attached  rate_violation.png, we have a spark streaming application that has 
> 40 receivers, which is set to consume 1 record per second. Within 5 minutes 
> the spark streaming application should take no more than 12k records/5 
> minutes (40*60*5 = 12k), but cloudwatch metrics shows it was consuming more 
> than that, which is almost at the rate of 22k records/5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25721) maxRate configuration not being used in Kinesis receiver

2019-09-16 Thread Karthikeyan Ravi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930803#comment-16930803
 ] 

Karthikeyan Ravi commented on SPARK-25721:
--

This problem very similar to https://issues.apache.org/jira/browse/SPARK-18371 
but that was fixed mainly on the kafka files. 

> maxRate configuration not being used in Kinesis receiver
> 
>
> Key: SPARK-25721
> URL: https://issues.apache.org/jira/browse/SPARK-25721
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Zhaobo Yu
>Priority: Major
> Attachments: Screen Shot 2019-09-16 at 12.27.25 PM.png, 
> rate_violation.png
>
>
> In the onStart() function of KinesisReceiver class, the 
> KinesisClientLibConfiguration object is initialized in the following way, 
> val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(
>  checkpointAppName,
>  streamName,
>  kinesisProvider,
>  dynamoDBCreds.map(_.provider).getOrElse(kinesisProvider),
>  cloudWatchCreds.map(_.provider).getOrElse(kinesisProvider),
>  workerId)
>  .withKinesisEndpoint(endpointUrl)
>  .withInitialPositionInStream(initialPositionInStream)
>  .withTaskBackoffTimeMillis(500)
>  .withRegionName(regionName)
>  
> As you can see there is no withMaxRecords() in initialization, so 
> KinesisClientLibConfiguration will set it to 1 by default since it has 
> been hard coded as this way,
> public static final int DEFAULT_MAX_RECORDS = 1;
> In such a case, the receiver will not fulfill any maxRate setting we set if 
> it's less than 10k, worse still, it will cause 
> ProvisionedThroughputExceededException from Kinesis, especially when we 
> restart the streaming application. 
>  
> Attached  rate_violation.png, we have a spark streaming application that has 
> 40 receivers, which is set to consume 1 record per second. Within 5 minutes 
> the spark streaming application should take no more than 12k records/5 
> minutes (40*60*5 = 12k), but cloudwatch metrics shows it was consuming more 
> than that, which is almost at the rate of 22k records/5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25721) maxRate configuration not being used in Kinesis receiver

2019-09-16 Thread Karthikeyan Ravi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthikeyan Ravi updated SPARK-25721:
-
Attachment: Screen Shot 2019-09-16 at 12.27.25 PM.png

> maxRate configuration not being used in Kinesis receiver
> 
>
> Key: SPARK-25721
> URL: https://issues.apache.org/jira/browse/SPARK-25721
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Zhaobo Yu
>Priority: Major
> Attachments: Screen Shot 2019-09-16 at 12.27.25 PM.png, 
> rate_violation.png
>
>
> In the onStart() function of KinesisReceiver class, the 
> KinesisClientLibConfiguration object is initialized in the following way, 
> val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(
>  checkpointAppName,
>  streamName,
>  kinesisProvider,
>  dynamoDBCreds.map(_.provider).getOrElse(kinesisProvider),
>  cloudWatchCreds.map(_.provider).getOrElse(kinesisProvider),
>  workerId)
>  .withKinesisEndpoint(endpointUrl)
>  .withInitialPositionInStream(initialPositionInStream)
>  .withTaskBackoffTimeMillis(500)
>  .withRegionName(regionName)
>  
> As you can see there is no withMaxRecords() in initialization, so 
> KinesisClientLibConfiguration will set it to 1 by default since it has 
> been hard coded as this way,
> public static final int DEFAULT_MAX_RECORDS = 1;
> In such a case, the receiver will not fulfill any maxRate setting we set if 
> it's less than 10k, worse still, it will cause 
> ProvisionedThroughputExceededException from Kinesis, especially when we 
> restart the streaming application. 
>  
> Attached  rate_violation.png, we have a spark streaming application that has 
> 40 receivers, which is set to consume 1 record per second. Within 5 minutes 
> the spark streaming application should take no more than 12k records/5 
> minutes (40*60*5 = 12k), but cloudwatch metrics shows it was consuming more 
> than that, which is almost at the rate of 22k records/5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25721) maxRate configuration not being used in Kinesis receiver

2019-09-16 Thread Karthikeyan Ravi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930796#comment-16930796
 ] 

Karthikeyan Ravi commented on SPARK-25721:
--

Hi Team, any updates on this, we are seeing a very similar problem with 
kinesis, where we are seeing a big batch followed by all 0 records and then a 
huge batch again. can you please give us any updates or workaround for this if 
you know of any?

> maxRate configuration not being used in Kinesis receiver
> 
>
> Key: SPARK-25721
> URL: https://issues.apache.org/jira/browse/SPARK-25721
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Zhaobo Yu
>Priority: Major
> Attachments: rate_violation.png
>
>
> In the onStart() function of KinesisReceiver class, the 
> KinesisClientLibConfiguration object is initialized in the following way, 
> val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(
>  checkpointAppName,
>  streamName,
>  kinesisProvider,
>  dynamoDBCreds.map(_.provider).getOrElse(kinesisProvider),
>  cloudWatchCreds.map(_.provider).getOrElse(kinesisProvider),
>  workerId)
>  .withKinesisEndpoint(endpointUrl)
>  .withInitialPositionInStream(initialPositionInStream)
>  .withTaskBackoffTimeMillis(500)
>  .withRegionName(regionName)
>  
> As you can see there is no withMaxRecords() in initialization, so 
> KinesisClientLibConfiguration will set it to 1 by default since it has 
> been hard coded as this way,
> public static final int DEFAULT_MAX_RECORDS = 1;
> In such a case, the receiver will not fulfill any maxRate setting we set if 
> it's less than 10k, worse still, it will cause 
> ProvisionedThroughputExceededException from Kinesis, especially when we 
> restart the streaming application. 
>  
> Attached  rate_violation.png, we have a spark streaming application that has 
> 40 receivers, which is set to consume 1 record per second. Within 5 minutes 
> the spark streaming application should take no more than 12k records/5 
> minutes (40*60*5 = 12k), but cloudwatch metrics shows it was consuming more 
> than that, which is almost at the rate of 22k records/5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26205) Optimize InSet expression for bytes, shorts, ints, dates

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930787#comment-16930787
 ] 

Liang-Chi Hsieh commented on SPARK-26205:
-

[~cloud_fan]. I see now. Created SPARK-29100.

> Optimize InSet expression for bytes, shorts, ints, dates
> 
>
> Key: SPARK-26205
> URL: https://issues.apache.org/jira/browse/SPARK-26205
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.0.0
>
>
> {{In}} expressions are compiled into a sequence of if-else statements, which 
> results in O\(n\) time complexity. {{InSet}} is an optimized version of 
> {{In}}, which is supposed to improve the performance if the number of 
> elements is big enough. However, {{InSet}} actually degrades the performance 
> in many cases due to various reasons (benchmarks were created in SPARK-26203 
> and solutions to the boxing problem are discussed in SPARK-26204).
> The main idea of this JIRA is to use Java {{switch}} statements to 
> significantly improve the performance of {{InSet}} expressions for bytes, 
> shorts, ints, dates. All {{switch}} statements are compiled into 
> {{tableswitch}} and {{lookupswitch}} bytecode instructions. We will have 
> O\(1\) time complexity if our case values are compact and {{tableswitch}} can 
> be used. Otherwise, {{lookupswitch}} will give us O\(log n\). Our local 
> benchmarks show that this logic is more than two times faster even on 500+ 
> elements than using primitive collections in {{InSet}} expressions. As Spark 
> is using Scala {{HashSet}} right now, the performance gain will be is even 
> bigger.
> See 
> [here|https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-3.html#jvms-3.10]
>  and 
> [here|https://stackoverflow.com/questions/10287700/difference-between-jvms-lookupswitch-and-tableswitch]
>  for more information.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-29100:
---

Assignee: Liang-Chi Hsieh

> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-23714) Add metrics for cached KafkaConsumer

2019-09-16 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-23714.
-

> Add metrics for cached KafkaConsumer
> 
>
> Key: SPARK-23714
> URL: https://issues.apache.org/jira/browse/SPARK-23714
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Ted Yu
>Priority: Major
>
> SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached 
> KafkaConsumer.
> This JIRA is to add metrics for measuring the operations of the cache so that 
> users can gain insight into the caching solution.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23714) Add metrics for cached KafkaConsumer

2019-09-16 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-23714.
---
Resolution: Duplicate

Apache commons pool added which provides jmx metrics.

> Add metrics for cached KafkaConsumer
> 
>
> Key: SPARK-23714
> URL: https://issues.apache.org/jira/browse/SPARK-23714
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Ted Yu
>Priority: Major
>
> SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached 
> KafkaConsumer.
> This JIRA is to add metrics for measuring the operations of the cache so that 
> users can gain insight into the caching solution.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-29100:

Description: 
SPARK-26205 adds an optimization to InSet that generates Java switch condition 
for certain cases. When the given set is empty, it is possibly that codegen 
causes compilation error:

 

[info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
milliseconds)                                      

[info]   Code generation of input[0, int, true] INSET () failed:                
                                                        

[info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
"generated.java": Compiling "apply(java.lang.Object _i)"; 
apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
size 0, now 1                                                                   
                                        

[info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
"generated.java": Compiling "apply(java.lang.Object _i)"; 
apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
size 0, now 1                                                                   
                                        

[info]         at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
                                                                                
        

[info]         at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
               

[info]         at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)

  was:SPARK-26205 adds an optimization to InSet that generates Java switch 
condition for certain cases. When the given set is empty, it is possibly that 
codegen causes compilation error.


> Codegen with switch in InSet expression causes compilation error
> 
>
> Key: SPARK-29100
> URL: https://issues.apache.org/jira/browse/SPARK-29100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26205 adds an optimization to InSet that generates Java switch 
> condition for certain cases. When the given set is empty, it is possibly that 
> codegen causes compilation error:
>  
> [info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 
> milliseconds)                                      
> [info]   Code generation of input[0, int, true] INSET () failed:              
>                                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]   org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in 
> "generated.java": Compiling "apply(java.lang.Object _i)"; 
> apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous 
> size 0, now 1                                                                 
>                                           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
>                                                                               
>           
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
>                
> [info]         at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29100) Codegen with switch in InSet expression causes compilation error

2019-09-16 Thread Liang-Chi Hsieh (Jira)
Liang-Chi Hsieh created SPARK-29100:
---

 Summary: Codegen with switch in InSet expression causes 
compilation error
 Key: SPARK-29100
 URL: https://issues.apache.org/jira/browse/SPARK-29100
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


SPARK-26205 adds an optimization to InSet that generates Java switch condition 
for certain cases. When the given set is empty, it is possibly that codegen 
causes compilation error.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-09-16 Thread Harichandan Pulagam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930763#comment-16930763
 ] 

Harichandan Pulagam edited comment on SPARK-27648 at 9/16/19 6:40 PM:
--

Here's a code example that reproduces the above issue, using the Apache 
distribution of Spark 2.4.3, and local filesystem for checkpointing:

{noformat}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Timestamp

case class UserEvent(
   name: String,
   age: Int,
   eventTime: Timestamp)

val schema: StructType = StructType(
  StructField("timestamp", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil)

val kafkaProperties: Map[String, String] =
   Map(
 "subscribe" -> "user",
 "startingOffsets" -> "earliest",
 "kafka.bootstrap.servers" -> )

val ds = spark.readStream
 .format("kafka")
 .options(kafkaProperties)
 .load
 .selectExpr("CAST (value AS STRING) AS JSON")
 .select(from_json(col("json"), schema).as("data"))
 .select(
   col("data.name").as("name"),
   col("data.age").as("age"),
   col("data.timestamp".as("timestamp"))
 .withColumn("eventTime", col("timestamp").cast(TimestampType))
 .drop("timestamp")
 .as[UserEvent]

val countDS =
  ds.withWatermark("eventTime", "1 minute")
.groupBy(
  window(
col("eventTime"), "1 minute", "30 seconds"),
col("age"))
.count

countDS.withColumn("topic", lit("user_out"))
  .selectExpr("topic", s"to_json(struct(window, age, count)) AS value")
  .writeStream
  .outputMode("append")
  .format("kafka")
  .trigger(Trigger.ProcessingTime("1 second"))
  .option("checkpointLocation", "file:///tmp")
  .option("kafka.bootstrap.servers", )
  .start
{noformat}


was (Author: harichandan):
Here's a code example that reproduces the above issue, using the Apache 
distribution of Spark 2.4.3, and local filesystem for checkpointing:

{noformat}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Timestamp

case class UserEvent(
   name: String,
   age: Int,
   eventTime: Timestamp)

val schema: StructType = StructType(
  StructField("timestamp", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil)

val kafkaProperties: Map[String, String] =
   Map(
 "subscribe" -> "user",
 "startingOffsets" -> "earliest",
 "kafka.bootstrap.servers" -> )

val ds = spark.readStream
 .format("kafka")
 .options(kafkaProperties)
 .load
 .selectExpr("CAST (value AS STRING) AS JSON")
 .select(from_json(col("json"), schema).as("data"))
 .select(
   col("data.name").as("name"),
   col("data.age").as("age"),
   col("data.timestamp".as("timestamp"))
 .withColumn("eventTime", col("timestamp").cast(TimestampType))
 .drop("timestamp")
 .as[UserEvent]

val countDS =
  ds.withWatermark("eventTime", "1 minute")
.groupBy(
  window(
col("eventTime"), "1 minute", "30 seconds"),
col("age"))
.count

countDS.withColumn("topic", lit("user_out"))
  .selectExpr("topic", s"to_json(struct(name, age)) AS value")
  .writeStream
  .outputMode("append")
  .format("kafka")
  .trigger(Trigger.ProcessingTime("1 second"))
  .option("checkpointLocation", "file:///tmp")
  .option("kafka.bootstrap.servers", )
  .start
{noformat}

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the 

[jira] [Resolved] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23694.
---
Resolution: Won't Fix

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-09-16 Thread Harichandan Pulagam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930763#comment-16930763
 ] 

Harichandan Pulagam edited comment on SPARK-27648 at 9/16/19 6:38 PM:
--

Here's a code example that reproduces the above issue, using the Apache 
distribution of Spark 2.4.3, and local filesystem for checkpointing:

{noformat}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Timestamp

case class UserEvent(
   name: String,
   age: Int,
   eventTime: Timestamp)

val schema: StructType = StructType(
  StructField("timestamp", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil)

val kafkaProperties: Map[String, String] =
   Map(
 "subscribe" -> "user",
 "startingOffsets" -> "earliest",
 "kafka.bootstrap.servers" -> )

val ds = spark.readStream
 .format("kafka")
 .options(kafkaProperties)
 .load
 .selectExpr("CAST (value AS STRING) AS JSON")
 .select(from_json(col("json"), schema).as("data"))
 .select(
   col("data.name").as("name"),
   col("data.age").as("age"),
   col("data.timestamp".as("timestamp"))
 .withColumn("eventTime", col("timestamp").cast(TimestampType))
 .drop("timestamp")
 .as[UserEvent]

val countDS =
  ds.withWatermark("eventTime", "1 minute")
.groupBy(
  window(
col("eventTime"), "1 minute", "30 seconds"),
col("age"))
.count

countDS.withColumn("topic", lit("user_out"))
  .selectExpr("topic", s"to_json(struct(name, age)) AS value")
  .writeStream
  .outputMode("append")
  .format("kafka")
  .trigger(Trigger.ProcessingTime("1 second"))
  .option("checkpointLocation", "file:///tmp")
  .option("kafka.bootstrap.servers", )
  .start
{noformat}


was (Author: harichandan):
Here's a code example that reproduces the above issue, using the Apache 
distribution of Spark 2.4.3, and local filesystem for checkpointing:

{noformat}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Timestamp

case class UserEvent(
   name: String,
   age: Int,
   eventTime: Timestamp)

val schema: StructType = StructType(
  StructField("timestamp", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil)

val kafkaProperties: Map[String, String] =
   Map(
 "subscribe" -> "user",
 "startingOffsets" -> "earliest",
 "kafka.bootstrap.servers" -> )

val ds = spark.readStream
 .format("kafka")
 .options(kafkaProperties)
 .load
 .selectExpr("CAST (value AS STRING) AS JSON")
 .select(from_json(col("json"), schema).as("data"))
 .select(
   col("data.name").as("name"),
   col("data.age").as("age"),
   col("data.timestamp".as("timestamp"))
 .withColumn("eventTime", col("timestamp").cast(TimestampType))
 .drop("timestamp")
 .as[UserEvent]

val countDS =
  ds.withWatermark("eventTime", "1 minute")
.groupBy(
  window(
col("eventTime"), "1 minute", "30 seconds"),
col("organization_id"))
.count

countDS.withColumn("topic", lit("user_out"))
  .selectExpr("topic", s"to_json(struct(name, age)) AS value")
  .writeStream
  .outputMode("append")
  .format("kafka")
  .trigger(Trigger.ProcessingTime("1 second"))
  .option("checkpointLocation", "file:///tmp")
  .option("kafka.bootstrap.servers", )
  .start
{noformat}

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and 

[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-09-16 Thread Harichandan Pulagam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930763#comment-16930763
 ] 

Harichandan Pulagam commented on SPARK-27648:
-

Here's a code example that reproduces the above issue, using the Apache 
distribution of Spark 2.4.3, and local filesystem for checkpointing:

{noformat}
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import java.sql.Timestamp

case class UserEvent(
   name: String,
   age: Int,
   eventTime: Timestamp)

val schema: StructType = StructType(
  StructField("timestamp", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField("age", IntegerType, true) :: Nil)

val kafkaProperties: Map[String, String] =
   Map(
 "subscribe" -> "user",
 "startingOffsets" -> "earliest",
 "kafka.bootstrap.servers" -> )

val ds = spark.readStream
 .format("kafka")
 .options(kafkaProperties)
 .load
 .selectExpr("CAST (value AS STRING) AS JSON")
 .select(from_json(col("json"), schema).as("data"))
 .select(
   col("data.name").as("name"),
   col("data.age").as("age"),
   col("data.timestamp".as("timestamp"))
 .withColumn("eventTime", col("timestamp").cast(TimestampType))
 .drop("timestamp")
 .as[UserEvent]

val countDS =
  ds.withWatermark("eventTime", "1 minute")
.groupBy(
  window(
col("eventTime"), "1 minute", "30 seconds"),
col("organization_id"))
.count

countDS.withColumn("topic", lit("user_out"))
  .selectExpr("topic", s"to_json(struct(name, age)) AS value")
  .writeStream
  .outputMode("append")
  .format("kafka")
  .trigger(Trigger.ProcessingTime("1 second"))
  .option("checkpointLocation", "file:///tmp")
  .option("kafka.bootstrap.servers", )
  .start
{noformat}

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png, 
> image-2019-06-02-19-43-21-652.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> 

[jira] [Resolved] (SPARK-24671) DataFrame length using a dunder/magic method in PySpark

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24671.
---
Resolution: Won't Fix

> DataFrame length using a dunder/magic method in PySpark
> ---
>
> Key: SPARK-24671
> URL: https://issues.apache.org/jira/browse/SPARK-24671
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> In Python, if a class implements a method called __len__, one can use the 
> builtin `len` function to get a length of an instance of said class, whatever 
> that means in its context. This is e.g. how you get the number of rows of a 
> pandas DataFrame.
> It should be straightforward to add this functionality to PySpark, because 
> df.count() is already implemented, so the patch I'm proposing is just two 
> lines of code (and two lines of tests). It's in this commit, I'll submit a PR 
> shortly.
> https://github.com/kokes/spark/commit/4d0afaf3cd046b11e8bae43dc00ddf4b1eb97732



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19184.
---
Resolution: Won't Fix

> Improve numerical stability for method tallSkinnyQR.
> 
>
> Key: SPARK-19184
> URL: https://issues.apache.org/jira/browse/SPARK-19184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Huamin Li
>Priority: Minor
>  Labels: None
>
> In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github 
> Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]).
>  When the upper triangular matrix R is ill-conditioned, computing the inverse 
> of R can result in catastrophic cancellation. Instead, we should consider 
> using a forward solve for solving Q such that Q * R = A.
> I first create a 4 by 4 RowMatrix A = 
> (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method 
> tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this 
> case, A is ill-conditioned and so is R.
> See codes in Spark Shell:
> {code:none}
> import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> // Create RowMatrix A.
> val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), 
> Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14))
> val denseMat = new RowMatrix(sc.parallelize(mat, 2))
> // Apply tallSkinnyQR to A.
> val result = denseMat.tallSkinnyQR(true)
> // Print the calculated Q and R.
> result.Q.rows.collect.foreach(println)
> result.R
> // Calculate Q*R. Ideally, this should be close to A.
> val reconstruct = result.Q.multiply(result.R)
> reconstruct.rows.collect.foreach(println)
> // Calculate Q'*Q. Ideally, this should be close to the identity matrix.
> result.Q.computeGramianMatrix()
> System.exit(0)
> {code}
> it will output the following results:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,1.5416524685312E13]
> [0.0,0.,0.0,8011776.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.15416524685312]
> [0.0,9.999E-6,0.,1.0008011776]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0 0.0 0.0  1.5416524685312E13
> 0.0 0.9998  0.0  8011775.9
> 0.0 0.0 1.0  0.0
> 1.5416524685312E13  8011775.9   0.0  2.3766923337289844E26
> {code}
> With forward solve for solving Q such that Q * R = A rather than computing 
> the inverse of R, it will output the following results instead:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,0.0]
> [0.0,1.0,0.0,0.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.0]
> [0.0,1.0E-5,1.0,1.0]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0  0.0  0.0  0.0
> 0.0  1.0  0.0  0.0
> 0.0  0.0  1.0  0.0
> 0.0  0.0  0.0  1.0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor w

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26524.
---
Resolution: Won't Fix

> If the application directory fails to be created on the SPARK_WORKER_DIR on 
> some woker nodes (for example, bad disk or disk has no capacity), the 
> application executor will be allocated indefinitely.
> --
>
> Key: SPARK-26524
> URL: https://issues.apache.org/jira/browse/SPARK-26524
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hantiantian
>Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When 
> the application is submitted, the disks mounted by the workerdir and 
> worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a 
> temporary directory. If the creation fails, the executor allocation fails. 
> The application directory fails to be created on the SPARK_WORKER_DIR on 
> woker121 and worker122,the application executor will be allocated 
> indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5765 on worker 
> worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5766 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5767 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5768 on worker 
> worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure 
> of the executor allocation. However, it can only handle the case where the 
> current application does not have an executor that has been successfully 
> assigned.
> if (!normalExit
>  && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
>  && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
>  val execs = appInfo.executors.values
>  if (!execs.exists(_.state == ExecutorState.RUNNING)) {
>  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
>  s"${appInfo.retryCount} times; removing it")
>  removeApplication(appInfo, ApplicationState.FAILED)
>  }
> }
> Master will only judge whether the worker is available according to the 
> resources of the worker. 
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
>  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
>  worker.coresFree >= coresPerExecutor)
>  .sortBy(_.coresFree).reverse
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26929) table owner should use user instead of principal when use kerberos

2019-09-16 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26929:
--

Assignee: hong dongdong

> table owner should use user instead of principal when use kerberos
> --
>
> Key: SPARK-26929
> URL: https://issues.apache.org/jira/browse/SPARK-26929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>
> In kerberos cluster, when use spark-sql or beeline to create table,  the 
> owner will be whole info of principal. the _issue_  was fixed in SPARK-19970 
> and modify by SPARK-22846, so it occur again. It will causes some problems 
> when using role., and this time should resolved two issues together.
> Use  org.apache.hadoop.hive.shims.Utils.getUGI  directly to get 
> ugi.getShortUserName
> instead of use  conf.getUser which return principal info.
> Code change
> {code:java}
> private val userName: String = try {
> val ugi = HiveUtils.getUGI
> ugi.getShortUserName
> } catch {
> case e: LoginException => throw new IOException(e)
> }
> {code}
> Berfore
> {code}
> scala> sql("create table t(a int)").show
>  scala> sql("desc formatted t").show(false)
>  ...
> |Owner:|sp...@example.com| |
> {code}
> After:
> {code}
>  scala> sql("create table t(a int)").show
>  scala> sql("desc formatted t").show(false)
>  ...
> |Owner:|spark| |
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26929) table owner should use user instead of principal when use kerberos

2019-09-16 Thread Marcelo Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26929.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 23952
[https://github.com/apache/spark/pull/23952]

> table owner should use user instead of principal when use kerberos
> --
>
> Key: SPARK-26929
> URL: https://issues.apache.org/jira/browse/SPARK-26929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
> Fix For: 3.0.0
>
>
> In kerberos cluster, when use spark-sql or beeline to create table,  the 
> owner will be whole info of principal. the _issue_  was fixed in SPARK-19970 
> and modify by SPARK-22846, so it occur again. It will causes some problems 
> when using role., and this time should resolved two issues together.
> Use  org.apache.hadoop.hive.shims.Utils.getUGI  directly to get 
> ugi.getShortUserName
> instead of use  conf.getUser which return principal info.
> Code change
> {code:java}
> private val userName: String = try {
> val ugi = HiveUtils.getUGI
> ugi.getShortUserName
> } catch {
> case e: LoginException => throw new IOException(e)
> }
> {code}
> Berfore
> {code}
> scala> sql("create table t(a int)").show
>  scala> sql("desc formatted t").show(false)
>  ...
> |Owner:|sp...@example.com| |
> {code}
> After:
> {code}
>  scala> sql("create table t(a int)").show
>  scala> sql("desc formatted t").show(false)
>  ...
> |Owner:|spark| |
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29099) org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set

2019-09-16 Thread Shixiong Zhu (Jira)
Shixiong Zhu created SPARK-29099:


 Summary: 
org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set
 Key: SPARK-29099
 URL: https://issues.apache.org/jira/browse/SPARK-29099
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Shixiong Zhu


I noticed that 
"org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime" is always 0 
in my environment. Looks like Spark never updates this field in metastore when 
reading a table. This is fine considering the cost to update it when reading a 
table is high.

However, "Last Access" in "describe extended" always shows "Thu Jan 01 00:00:00 
UTC 1970" and this is confusing. Can we show something alternative to indicate 
it's not set?




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread JP Bordenave (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930685#comment-16930685
 ] 

JP Bordenave edited comment on SPARK-13446 at 9/16/19 4:42 PM:
---

thanks for help, for test,  i remove hive-exec-2.3.6.jar  from spark/jars 
folder and  hdfs  /spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}


was (Author: jpbordi):
for test,  i remove hive-exec-2.3.6.jar  from spark/jars folder and  hdfs  
/spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread JP Bordenave (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930685#comment-16930685
 ] 

JP Bordenave edited comment on SPARK-13446 at 9/16/19 4:43 PM:
---

thanks for help, for test,  i remove directly the hive-exec-2.3.6.jar  from 
spark/jars folder and  hdfs  /spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}


was (Author: jpbordi):
thanks for help, for test,  i remove hive-exec-2.3.6.jar  from spark/jars 
folder and  hdfs  /spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread JP Bordenave (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930685#comment-16930685
 ] 

JP Bordenave edited comment on SPARK-13446 at 9/16/19 4:43 PM:
---

thanks for your help, for test,  i remove directly the hive-exec-2.3.6.jar  
from spark/jars folder and  hdfs  /spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}


was (Author: jpbordi):
thanks for help, for test,  i remove directly the hive-exec-2.3.6.jar  from 
spark/jars folder and  hdfs  /spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread JP Bordenave (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930685#comment-16930685
 ] 

JP Bordenave edited comment on SPARK-13446 at 9/16/19 4:41 PM:
---

for test,  i remove hive-exec-2.3.6.jar  from spark/jars folder and  hdfs  
/spark-jars/

but i got a hivexception when i ran

spark.sql("show databases").show
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}


was (Author: jpbordi):
i remove hive-exec from spark/jars and  hdfs  /spark-jars/

i got a hivexception
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2019-09-16 Thread JP Bordenave (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930685#comment-16930685
 ] 

JP Bordenave commented on SPARK-13446:
--

i remove hive-exec from spark/jars and  hdfs  /spark-jars/

i got a hivexception
{noformat}
Caused by: java.lang.reflect.InvocationTargetException: 
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
  ... 74 more
Caused by: java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
  ... 79 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.ql.metadata.HiveException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 80 more
 {noformat}

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29098) Test both ANSI mode and Spark mode

2019-09-16 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-29098:
--

 Summary: Test both ANSI mode and Spark mode
 Key: SPARK-29098
 URL: https://issues.apache.org/jira/browse/SPARK-29098
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


The PostgreSQL test case improves the test coverage of Spark SQL.
There are SQL files that have different results with/without ANSI 
flags(spark.sql.failOnIntegralTypeOverflow, spark.sql.parser.ansi.enabled, etc) 
enabled.
We should run tests against these SQL files with both ANSI mode and Spark mode.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930658#comment-16930658
 ] 

Liang-Chi Hsieh edited comment on SPARK-28927 at 9/16/19 3:36 PM:
--

Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe after 2.4. Thus the 
nondeterministic behavior can also be triggered when a repartition call 
following a shuffle in Spark 2.2.1. I noticed that you have repartition after 
you read data using a sql. Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.


was (Author: viirya):
Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe after 2.4. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> 

[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930658#comment-16930658
 ] 

Liang-Chi Hsieh edited comment on SPARK-28927 at 9/16/19 3:35 PM:
--

Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe after 2.4. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.


was (Author: viirya):
Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> 

[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-16 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930658#comment-16930658
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Because you are using 2.2.1, spark.sql.execution.sortBeforeRepartition is 
introduced to fix shuffle + repartition issue on Dataframe. The 
nondeterministic behavior can be triggered when a repartition call following a 
shuffle. I noticed that you have repartition after you read data using a sql. 
Maybe your sqltext has a shuffle operation.

You can try to checkpoint your training data, before fitting ALS model, to make 
the data deterministic.

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for 

[jira] [Updated] (SPARK-29070) Make SparkLauncher log full spark-submit command line

2019-09-16 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-29070:
---
Description: 
{{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
builds up a full command line to {{spark-submit}} using a builder pattern.  
When {{startApplication}} is finally called, a full command line is 
materialized out of all the options, then invoked via the {{ProcessBuilder}}.

In scenarios where another application is submitting to Spark, it would be 
extremely useful from a support and debugging standpoint to be able to see the 
full {{spark-submit}} command that is actually used (so that the same 
submission can be tested standalone, arguments tweaked, etc.).  Currently, the 
only way this gets captured is to {{stderr}} if the 
{{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is cumbersome 
in the context of an application that is wrapping Spark and already using the 
APIs.

I propose simply making {{SparkSubmit}} log the full command line it is about 
to launch, so that clients can see it directly in their log files, rather than 
having to capture and search through {{stderr}}.

  was:
{{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
builds up a full command line to {{spark-submit}} using a builder pattern.  
When {{startApplication}} is finally called, a 

In scenarios where another application is submitting to Spark, it would be 
extremely useful from a support and debugging standpoint to be able to see the 
full {{spark-submit}} command that is actually used (so that the same 
submission can be tested standalone, arguments tweaked, etc.).  Currently, the 
only way this gets captured is to {{stderr}} if the 
{{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is cumbersome 
in the context of an application that is wrapping Spark and already using the 
APIs.

I propose simply adding a getter method to {{SparkSubmit}} that allows clients 
to retrieve what the full command line will be, so they can log this however 
they wish (or do anything else with it).


> Make SparkLauncher log full spark-submit command line
> -
>
> Key: SPARK-29070
> URL: https://issues.apache.org/jira/browse/SPARK-29070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
> builds up a full command line to {{spark-submit}} using a builder pattern.  
> When {{startApplication}} is finally called, a full command line is 
> materialized out of all the options, then invoked via the {{ProcessBuilder}}.
> In scenarios where another application is submitting to Spark, it would be 
> extremely useful from a support and debugging standpoint to be able to see 
> the full {{spark-submit}} command that is actually used (so that the same 
> submission can be tested standalone, arguments tweaked, etc.).  Currently, 
> the only way this gets captured is to {{stderr}} if the 
> {{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is 
> cumbersome in the context of an application that is wrapping Spark and 
> already using the APIs.
> I propose simply making {{SparkSubmit}} log the full command line it is about 
> to launch, so that clients can see it directly in their log files, rather 
> than having to capture and search through {{stderr}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29070) Make SparkLauncher log full spark-submit command line

2019-09-16 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-29070:
---
Summary: Make SparkLauncher log full spark-submit command line  (was: Allow 
SparkLauncher to return full spark-submit command line)

> Make SparkLauncher log full spark-submit command line
> -
>
> Key: SPARK-29070
> URL: https://issues.apache.org/jira/browse/SPARK-29070
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.4.5
>Reporter: Jeff Evans
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> {{org.apache.spark.launcher.SparkLauncher}} wraps a {{ProcessBuilder}}, and 
> builds up a full command line to {{spark-submit}} using a builder pattern.  
> When {{startApplication}} is finally called, a 
> In scenarios where another application is submitting to Spark, it would be 
> extremely useful from a support and debugging standpoint to be able to see 
> the full {{spark-submit}} command that is actually used (so that the same 
> submission can be tested standalone, arguments tweaked, etc.).  Currently, 
> the only way this gets captured is to {{stderr}} if the 
> {{SPARK_PRINT_LAUNCH_COMMAND}} environment variable is set.  This is 
> cumbersome in the context of an application that is wrapping Spark and 
> already using the APIs.
> I propose simply adding a getter method to {{SparkSubmit}} that allows 
> clients to retrieve what the full command line will be, so they can log this 
> however they wish (or do anything else with it).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-09-16 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930616#comment-16930616
 ] 

Imran Rashid commented on SPARK-28917:
--

I finally got some more info about this case -- they are not using 
checkpointing, nor touching dependencies.  It seems things work consistently 
once they move the call to {{rdd.cache()}} before touching the RDD with 
multiple threads, so it could still be that caching alone is enough to mess 
this up somehow.

Just by coincidence, another entirely separate group is reporting what looks to 
be a very similar bug with submitting jobs from multiple threads.  Its not 
exactly the same, though -- it doesn't have the orphaned stages in the logs, 
but does have repeated RDD registration.

> Jobs can hang because of race of RDD.dependencies
> -
>
> Key: SPARK-28917
> URL: https://issues.apache.org/jira/browse/SPARK-28917
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Imran Rashid
>Priority: Major
>
> {{RDD.dependencies}} stores the precomputed cache value, but it is not 
> thread-safe.  This can lead to a race where the value gets overwritten, but 
> the DAGScheduler gets stuck in an inconsistent state.  In particular, this 
> can happen when there is a race between the DAGScheduler event loop, and 
> another thread (eg. a user thread, if there is multi-threaded job submission).
> First, a job is submitted by the user, which then computes the result Stage 
> and its parents:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983
> Which eventually makes a call to {{rdd.dependencies}}:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519
> At the same time, the user could also touch {{rdd.dependencies}} in another 
> thread, which could overwrite the stored value because of the race.
> Then the DAGScheduler checks the dependencies *again* later on in the job 
> submission, via {{getMissingParentStages}}
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025
> Because it will find new dependencies, it will create entirely different 
> stages.  Now the job has some orphaned stages which will never run.
> The symptoms of this are seeing disjoint sets of stages in the "Parents of 
> final stage" and the "Missing parents" messages on job submission, as well as 
> seeing repeated messages "Registered RDD X" for the same RDD id.  eg:
> {noformat}
> [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - 
> Starting job: count at XXX.scala:462
> ...
> [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> ...
> ...
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 5 (count at XXX.scala:462)
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Parents of final stage: List(ShuffleMapStage 4)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Missing parents: List(ShuffleMapStage 6)
> {noformat}
> Note that there is a similar issue w/ {{rdd.partitions}}. I don't see a way 
> it could mess up the scheduler (seems its only used for 
> {{rdd.partitions.length}}).  There is also an issue that {{rdd.storageLevel}} 
> is read and cached in the scheduler, but it could be modified simultaneously 
> by the user in another thread.   Similarly, I can't see a way it could effect 
> the scheduler.
> *WORKAROUND*:
> (a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
> by one thread (eg. in the thread that created it, or before you submit 
> multiple jobs touching that RDD from other threads). Then that value will get 
> cached.
> (b) don't submit jobs from multiple threads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-09-16 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930600#comment-16930600
 ] 

Thomas Graves commented on SPARK-27495:
---

The SPIP vote passed - 
https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3ccactgibw44ggts5apnlkuqsurnwqdejeadeiwfvdwl7ic3eh...@mail.gmail.com%3e

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part of the job.  To do this 
> 

[jira] [Resolved] (SPARK-29072) Properly track shuffle write time with refactor

2019-09-16 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-29072.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25780
[https://github.com/apache/spark/pull/25780]

> Properly track shuffle write time with refactor
> ---
>
> Key: SPARK-29072
> URL: https://issues.apache.org/jira/browse/SPARK-29072
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> From SPARK-28209, SPARK-28570, and SPARK-28571, we used the new shuffle 
> writer plugin API across all the shuffle writers. However, we accidentally 
> lost time tracking metrics for shuffle writes in the process, particularly 
> for UnsafeShuffleWriter when writing with streams (without transferTo), as 
> well as the SortShuffleWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29072) Properly track shuffle write time with refactor

2019-09-16 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-29072:


Assignee: Matt Cheah

> Properly track shuffle write time with refactor
> ---
>
> Key: SPARK-29072
> URL: https://issues.apache.org/jira/browse/SPARK-29072
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
>
> From SPARK-28209, SPARK-28570, and SPARK-28571, we used the new shuffle 
> writer plugin API across all the shuffle writers. However, we accidentally 
> lost time tracking metrics for shuffle writes in the process, particularly 
> for UnsafeShuffleWriter when writing with streams (without transferTo), as 
> well as the SortShuffleWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29061) Prints bytecode statistics in debugCodegen

2019-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29061:
---

Assignee: Takeshi Yamamuro

> Prints bytecode statistics in debugCodegen
> --
>
> Key: SPARK-29061
> URL: https://issues.apache.org/jira/browse/SPARK-29061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets to print bytecode statistics (max class bytecode size, 
> max method bytecode size, and max constant pool size) for generated classes 
> in debug prints, {{debugCodegen}}. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29061) Prints bytecode statistics in debugCodegen

2019-09-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29061.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25766
[https://github.com/apache/spark/pull/25766]

> Prints bytecode statistics in debugCodegen
> --
>
> Key: SPARK-29061
> URL: https://issues.apache.org/jira/browse/SPARK-29061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> This ticket targets to print bytecode statistics (max class bytecode size, 
> max method bytecode size, and max constant pool size) for generated classes 
> in debug prints, {{debugCodegen}}. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >