[jira] [Resolved] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26499.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23400
[https://github.com/apache/spark/pull/23400]

> JdbcUtils.makeGetter does not handle ByteType
> -
>
> Key: SPARK-26499
> URL: https://issues.apache.org/jira/browse/SPARK-26499
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas D'Silva
>Assignee: Thomas D'Silva
>Priority: Major
> Fix For: 3.0.0
>
>
> I am trying to use the  DataSource V2 API to read from a JDBC source. While 
> using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row 
> from a ResultSet that has a column of type TINYINT I ran into the following 
> exception
> {code:java}
> java.lang.IllegalArgumentException: Unsupported type tinyint
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340)
> {code}
> This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23458) Flaky test: OrcQuerySuite

2018-12-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23458.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

I'm closing this issue since this is resolved as a part of SPARK-26427 and I 
didn't hit this issue until now. Please feel free to reopen this if you see 
another instance of this failure.

>  Flaky test: OrcQuerySuite
> --
>
> Key: SPARK-23458
> URL: https://issues.apache.org/jira/browse/SPARK-23458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: AMPLab Jenkins
>Reporter: Marco Gaido
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Sometimes we have UT failures with the following stacktrace:
> {code:java}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01396221801 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 
> 1 

[jira] [Resolved] (SPARK-23390) Flaky test: FileBasedDataSourceSuite

2018-12-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23390.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

I'm closing this issue since this was resolved as a part of SPARK-26427 and I 
didn't hit this issues until now in our Jenkins environment. Please feel free 
to reopen this if there is another instance of this failure.

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 3.0.0
>
>
> *RECENT HISTORY*
> [http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.FileBasedDataSourceSuite_name=%28It+is+not+a+test+it+is+a+sbt.testing.SuiteSelector%29]
>  
> 
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code:java}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/]
> From a very quick look, these failures seem to be correlated with 
> [https://github.com/apache/spark/pull/20479] (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> {code:java}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/]
>  after [https://github.com/apache/spark/pull/20562] (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code:java}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/322/]
>  (May 3rd)
>  - 
> 

[jira] [Assigned] (SPARK-26499) JdbcUtils.makeGetter does not handle ByteType

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26499:


Assignee: Thomas D'Silva

> JdbcUtils.makeGetter does not handle ByteType
> -
>
> Key: SPARK-26499
> URL: https://issues.apache.org/jira/browse/SPARK-26499
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas D'Silva
>Assignee: Thomas D'Silva
>Priority: Major
> Fix For: 3.0.0
>
>
> I am trying to use the  DataSource V2 API to read from a JDBC source. While 
> using {{JdbcUtils.resultSetToSparkInternalRows}} to create an internal row 
> from a ResultSet that has a column of type TINYINT I ran into the following 
> exception
> {code:java}
> java.lang.IllegalArgumentException: Unsupported type tinyint
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter(JdbcUtils.scala:502)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters$1.apply(JdbcUtils.scala:379)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetters(JdbcUtils.scala:379)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.(JdbcUtils.scala:340)
> {code}
> This happens because ByteType is not handled in {{JdbcUtils.makeGetter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25903) Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout

2018-12-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25903:
--
Description: 
We hit this in our internal builds.

{noformat}
Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown
Stacktrace
  org.scalatest.exceptions.TestFailedException: Expected exception 
org.apache.spark.SparkException to be thrown, but no exception was thrown
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
  at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
  at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
  at 
org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:94)
  at 
org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:76)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
{noformat}

The problem from the logs is that the first task to call {{barrier()}} took a 
while, and at that point the "slow" task was already running, so the sleep 
finishes before the 2s timeout runs out.

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/169/

  was:
We hit this in our internal builds.

{noformat}
Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown
Stacktrace
  org.scalatest.exceptions.TestFailedException: Expected exception 
org.apache.spark.SparkException to be thrown, but no exception was thrown
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
  at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
  at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
  at 
org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:94)
  at 
org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:76)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
{noformat}

The problem from the logs is that the first task to call {{barrier()}} took a 
while, and at that point the "slow" task was already running, so the sleep 
finishes before the 2s timeout runs out.



> Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout
> -
>
> Key: SPARK-25903
> URL: https://issues.apache.org/jira/browse/SPARK-25903
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We hit this in our internal builds.
> {noformat}
> Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: Expected exception 
> org.apache.spark.SparkException to be thrown, but no exception was thrown
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:94)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:76)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> {noformat}
> The problem from the logs is that the first task to call {{barrier()}} took a 
> while, and at that point the "slow" task was already running, so the sleep 
> finishes before the 2s timeout runs out.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/169/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25903) Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout

2018-12-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25903:
--
Priority: Major  (was: Minor)

> Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout
> -
>
> Key: SPARK-25903
> URL: https://issues.apache.org/jira/browse/SPARK-25903
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> We hit this in our internal builds.
> {noformat}
> Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: Expected exception 
> org.apache.spark.SparkException to be thrown, but no exception was thrown
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:94)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:76)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> {noformat}
> The problem from the logs is that the first task to call {{barrier()}} took a 
> while, and at that point the "slow" task was already running, so the sleep 
> finishes before the 2s timeout runs out.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/169/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25903) Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout

2018-12-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25903:
--
Affects Version/s: 3.0.0

> Flaky test: BarrierTaskContextSuite.throw exception on barrier() call timeout
> -
>
> Key: SPARK-25903
> URL: https://issues.apache.org/jira/browse/SPARK-25903
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We hit this in our internal builds.
> {noformat}
> Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: Expected exception 
> org.apache.spark.SparkException to be thrown, but no exception was thrown
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:822)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1560)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:94)
>   at 
> org.apache.spark.scheduler.BarrierTaskContextSuite$$anonfun$7.apply(BarrierTaskContextSuite.scala:76)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> {noformat}
> The problem from the logs is that the first task to call {{barrier()}} took a 
> while, and at that point the "slow" task was already running, so the sleep 
> finishes before the 2s timeout runs out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26514) Introduce a new conf to improve the task parallelism

2018-12-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26514:


Assignee: Apache Spark

> Introduce a new conf to improve the task parallelism
> 
>
> Key: SPARK-26514
> URL: https://issues.apache.org/jira/browse/SPARK-26514
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, we have a conf below.
> {code:java}
> private[spark] val CPUS_PER_TASK = 
> ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
> {code}
> For some applications which are not cpu-intensive,we may want to let one cpu 
> to run more than one task.
> Like:
> {code:java}
> private[spark] val TASKS_PER_CPU = 
> ConfigBuilder("spark.cpu.tasks").intConf.createWithDefault(1)
> {code}
> Which can improve performance for some applications and can also improve 
> resource utilization



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26514) Introduce a new conf to improve the task parallelism

2018-12-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26514:


Assignee: (was: Apache Spark)

> Introduce a new conf to improve the task parallelism
> 
>
> Key: SPARK-26514
> URL: https://issues.apache.org/jira/browse/SPARK-26514
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Currently, we have a conf below.
> {code:java}
> private[spark] val CPUS_PER_TASK = 
> ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
> {code}
> For some applications which are not cpu-intensive,we may want to let one cpu 
> to run more than one task.
> Like:
> {code:java}
> private[spark] val TASKS_PER_CPU = 
> ConfigBuilder("spark.cpu.tasks").intConf.createWithDefault(1)
> {code}
> Which can improve performance for some applications and can also improve 
> resource utilization



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26215) define reserved keywords after SQL standard

2018-12-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26215:

Target Version/s: 3.0.0

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26514) Introduce a new conf to improve the task parallelism

2018-12-31 Thread zhoukang (JIRA)
zhoukang created SPARK-26514:


 Summary: Introduce a new conf to improve the task parallelism
 Key: SPARK-26514
 URL: https://issues.apache.org/jira/browse/SPARK-26514
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0, 2.1.0
Reporter: zhoukang


Currently, we have a conf below.
{code:java}
private[spark] val CPUS_PER_TASK = 
ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
{code}
For some applications which are not cpu-intensive,we may want to let one cpu to 
run more than one task.
Like:
{code:java}
private[spark] val TASKS_PER_CPU = 
ConfigBuilder("spark.cpu.tasks").intConf.createWithDefault(1)
{code}
Which can improve performance for some applications and can also improve 
resource utilization




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26511) java.lang.ClassCastException error when loading Spark MLlib model from parquet file

2018-12-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731507#comment-16731507
 ] 

Hyukjin Kwon commented on SPARK-26511:
--

Can you attach self-runnable reproducer please? That makes easier to 
investigate.

> java.lang.ClassCastException error when loading Spark MLlib model from 
> parquet file
> ---
>
> Key: SPARK-26511
> URL: https://issues.apache.org/jira/browse/SPARK-26511
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Amy Koh
>Priority: Major
> Attachments: repro.zip
>
>
> When I tried to load a decision tree model from a parquet file, the following 
> error is thrown. 
> {code:bash}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.mllib.tree.model.DecisionTreeModel.load. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, localhost, executor driver): java.lang.ClassCastException: class 
> java.lang.Double cannot be cast to class java.lang.Integer (java.lang.Double 
> and java.lang.Integer are in module java.base of loader 'bootstrap') at 
> scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at 
> org.apache.spark.sql.Row$class.getInt(Row.scala:223) at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:165) 
> at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData$.apply(DecisionTreeModel.scala:171)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$NodeData$.apply(DecisionTreeModel.scala:195)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834) Driver stacktrace: at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at 
> org.apache.spark.rdd.RDD.collect(RDD.scala:935) at 
> 

[jira] [Updated] (SPARK-26505) Catalog class Function is missing "database" field

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26505:
-
Target Version/s:   (was: 3.0.0)

> Catalog class Function is missing "database" field
> --
>
> Key: SPARK-26505
> URL: https://issues.apache.org/jira/browse/SPARK-26505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Devin Boyer
>Priority: Minor
>
> This change fell out of the review of 
> [https://github.com/apache/spark/pull/20658,] which is the implementation of 
> https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
> [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
>  contains a `database` attribute, while the [Python 
> version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
>  does not.
>  
> To be consistent, it would likely be best to add the `database` attribute to 
> the Python class. This would be a breaking API change, though (as discussed 
> in [this PR 
> comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]),
>  so it would have to be made for Spark 3.0.0, the next major version where 
> breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26505) Catalog class Function is missing "database" field

2018-12-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731510#comment-16731510
 ] 

Hyukjin Kwon commented on SPARK-26505:
--

(please avoid to set the target version which is usually reserved for 
committers)

> Catalog class Function is missing "database" field
> --
>
> Key: SPARK-26505
> URL: https://issues.apache.org/jira/browse/SPARK-26505
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Devin Boyer
>Priority: Minor
>
> This change fell out of the review of 
> [https://github.com/apache/spark/pull/20658,] which is the implementation of 
> https://issues.apache.org/jira/browse/SPARK-23488. The Scala Catalog class 
> [Function|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.catalog.Function]
>  contains a `database` attribute, while the [Python 
> version|https://github.com/apache/spark/blob/v2.4.0/python/pyspark/sql/catalog.py#L32]
>  does not.
>  
> To be consistent, it would likely be best to add the `database` attribute to 
> the Python class. This would be a breaking API change, though (as discussed 
> in [this PR 
> comment|[https://github.com/apache/spark/pull/20658#issuecomment-368561007]]),
>  so it would have to be made for Spark 3.0.0, the next major version where 
> breaking API changes can occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26510.
--
Resolution: Cannot Reproduce

This is fixed in the current master. It should be great if the JIRA fixing that 
problem is identified, and backported if applicable.

> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
> {code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.createOrReplaceTempView("cached_table")
> val df3 = session.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
> twice.
> Managed to mitigate by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731503#comment-16731503
 ] 

Hyukjin Kwon commented on SPARK-26512:
--

Please avoid to set a blocker which is reserved for committers.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Blocker
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731506#comment-16731506
 ] 

Hyukjin Kwon commented on SPARK-26512:
--

Looks related with netty version conflict.

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Major
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26512:
-
Flags:   (was: Important)

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Major
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26512:
-
Priority: Major  (was: Blocker)

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Major
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26413) SPIP: RDD Arrow Support in Spark Core and PySpark

2018-12-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731500#comment-16731500
 ] 

Hyukjin Kwon commented on SPARK-26413:
--

I agree the usage can be limited for now; however, looks thats already what's 
going to happen if we add {{RDD[ArrowTable]}} <-> {{DataFrame}}.
I thought we're going to reuse this code path.

I vaguely assume that the API suggested in the JIRA, and the codes below are 
basically similar:

{code}
def arrowTableToRows(table: Iterator[ArrowTable]): Iterator[Row] {
  val arrowTable = table.next()
  assert !table.hasNext
  // convert arrow table to rows
}

def rowsToArrowTable: Iterator[Row]): Iterator[ArrowTable] {
  // convert rows to arrow table
  Iterator.single(...)
}
{code}

so that you can use:

{code}
val arrowRDD = df.rdd.mapPartitions(rowsToArrowTable)
val df = arrowRDD.mapPartitions(rowsToArrowTable).toDF
{code}

Like:

{code}
val df = spark.range(100).repartition(10)
val rdd = df.rdd.mapPartitions(iter => Iterator.single(iter.toArray))
val df = rdd.mapPartitions(iter => iter.next.iterator).toDF
df.show()
{code}

If the advantage of adding it as an RDD API is mainly to avoid extra RDD 
operations (correct me if I'm mistaken), I doubt if we should add them as an 
RDD APIs, if this can be resolved by adding, for instance, {{arrowTableToRows}} 
and {{rowsToArrowTable}}.


> SPIP: RDD Arrow Support in Spark Core and PySpark
> -
>
> Key: SPARK-26413
> URL: https://issues.apache.org/jira/browse/SPARK-26413
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Richard Whitcomb
>Priority: Minor
>
> h2. Background and Motivation
> Arrow is becoming an standard interchange format for columnar Structured 
> Data.  This is already true in Spark with the use of arrow in the pandas udf 
> functions in the dataframe API.
> However the current implementation of arrow in spark is limited to two use 
> cases.
>  * Pandas UDF that allows for operations on one or more columns in the 
> DataFrame API.
>  * Collect as Pandas which pulls back the entire dataset to the driver in a 
> Pandas Dataframe.
> What is still hard however is making use of all of the columns in a Dataframe 
> while staying distributed across the workers.  The only way to do this 
> currently is to drop down into RDDs and collect the rows into a dataframe. 
> However pickling is very slow and the collecting is expensive.
> The proposal is to extend spark in a way that allows users to operate on an 
> Arrow Table fully while still making use of Spark's underlying technology.  
> Some examples of possibilities with this new API. 
>  * Pass the Arrow Table with Zero Copy to PyTorch for predictions.
>  * Pass to Nvidia Rapids for an algorithm to be run on the GPU.
>  * Distribute data across many GPUs making use of the new Barriers API.
> h2. Targets users and personas
> ML, Data Scientists, and future library authors..
> h2. Goals
>  * Conversion from any Dataset[Row] or PySpark Dataframe to RDD[Table]
>  * Conversion back from any RDD[Table] to Dataset[Row], RDD[Row], Pyspark 
> Dataframe
>  * Open the possibilities to tighter integration between Arrow/Pandas/Spark 
> especially at a library level.
> h2. Non-Goals
>  * Not creating a new API but instead using existing APIs.
> h2. Proposed API changes
> h3. Data Objects
> case class ArrowTable(schema: Schema, batches: Iterable[ArrowRecordBatch])
> h3. Dataset.scala
> {code:java}
> // Converts a Dataset to an RDD of Arrow Tables
> // Each RDD row is an Interable of Arrow Batches.
> def arrowRDD: RDD[ArrowTable]
>  
> // Utility Function to convert to RDD Arrow Table for PySpark
> private[sql] def javaToPythonArrow: JavaRDD[Array[Byte]]
> {code}
> h3. RDD.scala
> {code:java}
>  // Converts RDD[ArrowTable] to an Dataframe by inspecting the Arrow Schema
>  def arrowToDataframe(implicit ev: T <:< ArrowTable): Dataframe
>   
>  // Converts RDD[ArrowTable] to an RDD of Rows
>  def arrowToRDD(implicit ev: T <:< ArrowTable): RDD[Row]{code}
> h3. Serializers.py
> {code:java}
> # Serializer that takes a Serialized Arrow Tables and returns a pyarrow Table.
> class ArrowSerializer(FramedSerializer)
> {code}
> h3. RDD.py
> {code}
> # New RDD Class that has an RDD[ArrowTable] behind it and uses the new 
> ArrowSerializer instead of the normal Pickle Serializer
> class ArrowRDD(RDD){code}
>  
> h3. Dataframe.py
> {code}
> // New Function that converts a pyspark dataframe into an ArrowRDD
> def arrow(self):
> {code}
>  
> h2. Example API Usage
> h3. Pyspark
> {code}
> # Select a Single Column Using Pandas
> def map_table(arrow_table):
>   import pyarrow as pa
>   pdf = arrow_table.to_pandas()
>   pdf = pdf[['email']]
>   return pa.Table.from_pandas(pdf)
> # Convert to Arrow RDD, map over 

[jira] [Reopened] (SPARK-26339) Behavior of reading files that start with underscore is confusing

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-26339:
--

> Behavior of reading files that start with underscore is confusing
> -
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Keiichi Hirobe
>Assignee: Keiichi Hirobe
>Priority: Minor
>
> Behavior of reading files that start with underscore is as follows.
>  # spark.read (no schema) throws exception which message is confusing.
>  # spark.read (userSpecificationSchema) succesfully reads, but content is 
> emtpy.
> Example of files are as follows.
>  The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell  --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-+---+
> |  _c0|_c1|
> +-+---+
> |test1| 10|
> |test2| 20|
> +-+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-+--+
> | test|number|
> +-+--+
> |test1|10|
> |test2|20|
> +-+--+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
>   ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> ++--+
> |test|number|
> ++--+
> ++--+
> {code}
> I noticed that spark cannot read files that start with underscore after I 
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26339) Behavior of reading files that start with underscore is confusing

2018-12-31 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26339:
-
Fix Version/s: (was: 3.0.0)

> Behavior of reading files that start with underscore is confusing
> -
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Keiichi Hirobe
>Assignee: Keiichi Hirobe
>Priority: Minor
>
> Behavior of reading files that start with underscore is as follows.
>  # spark.read (no schema) throws exception which message is confusing.
>  # spark.read (userSpecificationSchema) succesfully reads, but content is 
> emtpy.
> Example of files are as follows.
>  The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell  --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-+---+
> |  _c0|_c1|
> +-+---+
> |test1| 10|
> |test2| 20|
> +-+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-+--+
> | test|number|
> +-+--+
> |test1|10|
> |test2|20|
> +-+--+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
>   ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> ++--+
> |test|number|
> ++--+
> ++--+
> {code}
> I noticed that spark cannot read files that start with underscore after I 
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-26513:
-
Description: 
After going through this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
found performance improvments spark speedup with 
Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 

  was:
Correct me if I'm wrong.
*Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 


> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> After going through this paper 
> [https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
> found performance improvments spark speedup with 
> Correct me if I'm wrong.
>  *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-26513:
-
Description: 
 

Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 

we refered this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
found performance improvements in our long running spark batch job's .

  was:
After going through this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
found performance improvements in our long running spark batch job's 

Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 


> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> Correct me if I'm wrong.
>  *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 
> we refered this paper 
> [https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
> found performance improvements in our long running spark batch job's .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-26513:
-
Description: 
 

Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 

we referred this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] and 
we found performance improvements in our long-running spark batch job's.

  was:
 

Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 

we refered this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
found performance improvements in our long running spark batch job's .


> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> Correct me if I'm wrong.
>  *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 
> we referred this paper 
> [https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] 
> and we found performance improvements in our long-running spark batch job's.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-26513:
-
Description: 
After going through this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
found performance improvements in our long running spark batch job's 

Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 

  was:
After going through this paper 
[https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
found performance improvments spark speedup with 
Correct me if I'm wrong.
 *Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 


> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> After going through this paper 
> [https://www.computer.org/csdl/proceedings/hipc/2016/5411/00/07839705.pdf] we 
> found performance improvements in our long running spark batch job's 
> Correct me if I'm wrong.
>  *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26470) Use ConfigEntry for hardcoded configs for eventLog category.

2018-12-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26470.
---
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23395

> Use ConfigEntry for hardcoded configs for eventLog category.
> 
>
> Key: SPARK-26470
> URL: https://issues.apache.org/jira/browse/SPARK-26470
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731435#comment-16731435
 ] 

Apache Spark commented on SPARK-26513:
--

User 'SandishKumarHN' has created a pull request for this issue:
https://github.com/apache/spark/pull/23401

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> Correct me if I'm wrong.
> *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731436#comment-16731436
 ] 

Sandish Kumar HN commented on SPARK-26513:
--

I have the pull request here [PR 
LINK|https://github.com/apache/spark/pull/23401]

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> Correct me if I'm wrong.
> *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN updated SPARK-26513:
-
Comment: was deleted

(was: I have the pull request here [PR 
LINK|https://github.com/apache/spark/pull/23401])

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> Correct me if I'm wrong.
> *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26513:


Assignee: Apache Spark

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>
> Correct me if I'm wrong.
> *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26513:


Assignee: (was: Apache Spark)

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> Correct me if I'm wrong.
> *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731434#comment-16731434
 ] 

Apache Spark commented on SPARK-26513:
--

User 'SandishKumarHN' has created a pull request for this issue:
https://github.com/apache/spark/pull/23401

> Trigger GC on executor node idle
> 
>
> Key: SPARK-26513
> URL: https://issues.apache.org/jira/browse/SPARK-26513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sandish Kumar HN
>Priority: Major
> Fix For: 3.0.0
>
>
> Correct me if I'm wrong.
> *Stage:*
>       On a large cluster, each stage would have some executors. were a few 
> executors would finish a couple of tasks first and wait for whole stage or 
> remaining tasks to finish which are executed by different executors nodes in 
> a cluster. a stage will only be completed when all tasks in a current stage 
> finish its execution. and the next stage execution has to wait till all tasks 
> of the current stage are completed. 
>  
> why don't we trigger GC, when the executor node is waiting for remaining 
> tasks to finish, or executor Idle? anyways executor has to wait for the 
> remaining tasks to finish which can at least take a couple of seconds. why 
> don't we trigger GC? which will max take <300ms
>  
> I have proposed a small code snippet which triggers GC when running tasks are 
> empty and heap usage in current executor node is more than the given 
> threshold.
> This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26513) Trigger GC on executor node idle

2018-12-31 Thread Sandish Kumar HN (JIRA)
Sandish Kumar HN created SPARK-26513:


 Summary: Trigger GC on executor node idle
 Key: SPARK-26513
 URL: https://issues.apache.org/jira/browse/SPARK-26513
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sandish Kumar HN
 Fix For: 3.0.0


Correct me if I'm wrong.
*Stage:*
      On a large cluster, each stage would have some executors. were a few 
executors would finish a couple of tasks first and wait for whole stage or 
remaining tasks to finish which are executed by different executors nodes in a 
cluster. a stage will only be completed when all tasks in a current stage 
finish its execution. and the next stage execution has to wait till all tasks 
of the current stage are completed. 
 
why don't we trigger GC, when the executor node is waiting for remaining tasks 
to finish, or executor Idle? anyways executor has to wait for the remaining 
tasks to finish which can at least take a couple of seconds. why don't we 
trigger GC? which will max take <300ms
 
I have proposed a small code snippet which triggers GC when running tasks are 
empty and heap usage in current executor node is more than the given threshold.
This could improve performance for long-running spark job's. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Anubhav Jain (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Jain updated SPARK-26512:
-
Description: I have installed Hadoop version 2.8.3 in my windows 10 
environment and its working fine. Now when i try to install Apache 
Spark(version 2.4.0) with yarn as cluster manager and its not working. When i 
try to submit a spark job using spark-submit for testing , so its coming under 
ACCEPTED tab in YARN UI after that it fail  (was: I have installed Hadoop 
version 2.8.3 in my windows 10 environment and its working fine. Now when i try 
to install Apache Spark(version 2.4.0) with yarn as cluster manager and its not 
working. When i try to submit a spark job using spark-submit for testing , so 
its coming under ACCEPTED tab in YARN UI after that it fa[link 
title|http://example.com]https://i.stack.imgur.com/OQtbV.pngils.)

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Blocker
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fail



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Anubhav Jain (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Jain updated SPARK-26512:
-
Attachment: log.png

> Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?
> ---
>
> Key: SPARK-26512
> URL: https://issues.apache.org/jira/browse/SPARK-26512
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, YARN
>Affects Versions: 2.4.0
> Environment: operating system : Windows 10
> Spark Version : 2.4.0
> Hadoop Version : 2.8.3
>Reporter: Anubhav Jain
>Priority: Blocker
>  Labels: windows
> Attachments: log.png
>
>
> I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
> working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
> as cluster manager and its not working. When i try to submit a spark job 
> using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI 
> after that it fa[link 
> title|http://example.com]https://i.stack.imgur.com/OQtbV.pngils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26512) Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 10?

2018-12-31 Thread Anubhav Jain (JIRA)
Anubhav Jain created SPARK-26512:


 Summary: Spark 2.4.0 is not working with Hadoop 2.8.3 in windows 
10?
 Key: SPARK-26512
 URL: https://issues.apache.org/jira/browse/SPARK-26512
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell, YARN
Affects Versions: 2.4.0
 Environment: operating system : Windows 10

Spark Version : 2.4.0

Hadoop Version : 2.8.3
Reporter: Anubhav Jain


I have installed Hadoop version 2.8.3 in my windows 10 environment and its 
working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn 
as cluster manager and its not working. When i try to submit a spark job using 
spark-submit for testing , so its coming under ACCEPTED tab in YARN UI after 
that it fa[link title|http://example.com]https://i.stack.imgur.com/OQtbV.pngils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26495) Simplify SelectedField extractor

2018-12-31 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26495.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Simplify SelectedField extractor
> 
>
> Key: SPARK-26495
> URL: https://issues.apache.org/jira/browse/SPARK-26495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 3.0.0
>
>
> I was reading through the code of the {{SelectedField}} extractor and this is 
> overly complex. It contains a couple of pattern matches that are redundant.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26339) Behavior of reading files that start with underscore is confusing

2018-12-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26339.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23288
[https://github.com/apache/spark/pull/23288]

> Behavior of reading files that start with underscore is confusing
> -
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Keiichi Hirobe
>Assignee: Keiichi Hirobe
>Priority: Minor
> Fix For: 3.0.0
>
>
> Behavior of reading files that start with underscore is as follows.
>  # spark.read (no schema) throws exception which message is confusing.
>  # spark.read (userSpecificationSchema) succesfully reads, but content is 
> emtpy.
> Example of files are as follows.
>  The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell  --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-+---+
> |  _c0|_c1|
> +-+---+
> |test1| 10|
> |test2| 20|
> +-+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-+--+
> | test|number|
> +-+--+
> |test1|10|
> |test2|20|
> +-+--+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
>   ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> ++--+
> |test|number|
> ++--+
> ++--+
> {code}
> I noticed that spark cannot read files that start with underscore after I 
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26339) Behavior of reading files that start with underscore is confusing

2018-12-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26339:
-

Assignee: Keiichi Hirobe

> Behavior of reading files that start with underscore is confusing
> -
>
> Key: SPARK-26339
> URL: https://issues.apache.org/jira/browse/SPARK-26339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Keiichi Hirobe
>Assignee: Keiichi Hirobe
>Priority: Minor
> Fix For: 3.0.0
>
>
> Behavior of reading files that start with underscore is as follows.
>  # spark.read (no schema) throws exception which message is confusing.
>  # spark.read (userSpecificationSchema) succesfully reads, but content is 
> emtpy.
> Example of files are as follows.
>  The same behavior occured when I read json files.
> {code:bash}
> $ cat test.csv
> test1,10
> test2,20
> $ cp test.csv _test.csv
> $ ./bin/spark-shell  --master local[2]
> {code}
> Spark shell snippet for reproduction:
> {code:java}
> scala> val df=spark.read.csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]
> scala> df.show()
> +-+---+
> |  _c0|_c1|
> +-+---+
> |test1| 10|
> |test2| 20|
> +-+---+
> scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> +-+--+
> | test|number|
> +-+--+
> |test1|10|
> |test2|20|
> +-+--+
> scala> val df=spark.read.csv("_test.csv")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
>   ... 49 elided
> scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
> df: org.apache.spark.sql.DataFrame = [test: string, number: int]
> scala> df.show()
> ++--+
> |test|number|
> ++--+
> ++--+
> {code}
> I noticed that spark cannot read files that start with underscore after I 
> read some codes.(I could not find any documents about file name limitation)
> Above behavior is not good especially userSpecificationSchema case, I think.
> I suggest to throw exception which message is "Path does not exist" in both 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26511) java.lang.ClassCastException error when loading Spark MLlib model from parquet file

2018-12-31 Thread Amy Koh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amy Koh updated SPARK-26511:

Attachment: repro.zip

> java.lang.ClassCastException error when loading Spark MLlib model from 
> parquet file
> ---
>
> Key: SPARK-26511
> URL: https://issues.apache.org/jira/browse/SPARK-26511
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Amy Koh
>Priority: Major
> Attachments: repro.zip
>
>
> When I tried to load a decision tree model from a parquet file, the following 
> error is thrown. 
> {code:bash}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.mllib.tree.model.DecisionTreeModel.load. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, localhost, executor driver): java.lang.ClassCastException: class 
> java.lang.Double cannot be cast to class java.lang.Integer (java.lang.Double 
> and java.lang.Integer are in module java.base of loader 'bootstrap') at 
> scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at 
> org.apache.spark.sql.Row$class.getInt(Row.scala:223) at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:165) 
> at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData$.apply(DecisionTreeModel.scala:171)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$NodeData$.apply(DecisionTreeModel.scala:195)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
>  at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:834) Driver stacktrace: at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at 
> org.apache.spark.rdd.RDD.collect(RDD.scala:935) at 
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:262)
>  at 
> 

[jira] [Updated] (SPARK-26501) Unexpected overriden of exitFn in SparkSubmitSuite

2018-12-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26501:
--
Priority: Minor  (was: Major)

> Unexpected overriden of exitFn in SparkSubmitSuite
> --
>
> Key: SPARK-26501
> URL: https://issues.apache.org/jira/browse/SPARK-26501
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Priority: Minor
>
> When I run SparkSubmitSuite of spark2.3.2 in intellij IDE, I found that some 
> tests cannot pass when I run them one by one, but they passed when the whole 
> SparkSubmitSuite was run.
> Failed tests when ran seperately:
>  
> {code:java}
> test("SPARK_CONF_DIR overrides spark-defaults.conf") {
>   forConfDir(Map("spark.executor.memory" -> "2.3g")) { path =>
> val unusedJar = TestUtils.createJarWithClasses(Seq.empty)
> val args = Seq(
>   "--class", SimpleApplicationTest.getClass.getName.stripSuffix("$"),
>   "--name", "testApp",
>   "--master", "local",
>   unusedJar.toString)
> val appArgs = new SparkSubmitArguments(args, Map("SPARK_CONF_DIR" -> 
> path))
> assert(appArgs.defaultPropertiesFile != null)
> assert(appArgs.defaultPropertiesFile.startsWith(path))
> assert(appArgs.propertiesFile == null)
> appArgs.executorMemory should be ("2.3g")
>   }
> }
> {code}
> Failure reason:
> {code:java}
> Error: Executor Memory cores must be a positive number
> Run with --help for usage help or --verbose for debug output
> {code}
>  
> After carefully checked the code, I found the exitFn of SparkSubmit is 
> overrided by front tests in testPrematrueExit call.
> Although the above test was fixed by SPARK-22941, but the overriden of exitFn 
> might cause other problems in the future.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26511) java.lang.ClassCastException error when loading Spark MLlib model from parquet file

2018-12-31 Thread Amy Koh (JIRA)
Amy Koh created SPARK-26511:
---

 Summary: java.lang.ClassCastException error when loading Spark 
MLlib model from parquet file
 Key: SPARK-26511
 URL: https://issues.apache.org/jira/browse/SPARK-26511
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.4.0
Reporter: Amy Koh


When I tried to load a decision tree model from a parquet file, the following 
error is thrown. 

{code:bash}

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.mllib.tree.model.DecisionTreeModel.load. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
2, localhost, executor driver): java.lang.ClassCastException: class 
java.lang.Double cannot be cast to class java.lang.Integer (java.lang.Double 
and java.lang.Integer are in module java.base of loader 'bootstrap') at 
scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at 
org.apache.spark.sql.Row$class.getInt(Row.scala:223) at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(rows.scala:165) at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData$.apply(DecisionTreeModel.scala:171)
 at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$NodeData$.apply(DecisionTreeModel.scala:195)
 at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
 at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$9.apply(DecisionTreeModel.scala:247)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:834) Driver stacktrace: at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486) at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
 at scala.Option.foreach(Option.scala:257) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2022) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2043) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2062) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) at 
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at 
org.apache.spark.rdd.RDD.collect(RDD.scala:935) at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:262)
 at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.load(DecisionTreeModel.scala:249)
 at 
org.apache.spark.mllib.tree.model.DecisionTreeModel$.load(DecisionTreeModel.scala:326)
 at 
org.apache.spark.mllib.tree.model.DecisionTreeModel.load(DecisionTreeModel.scala)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 

[jira] [Resolved] (SPARK-15967) Spark UI should show realtime value of storage memory instead of showing one static value all the time

2018-12-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15967.
---
Resolution: Won't Fix

> Spark UI should show realtime value of storage memory instead of showing one 
> static value all the time
> --
>
> Key: SPARK-15967
> URL: https://issues.apache.org/jira/browse/SPARK-15967
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Umesh K
>Priority: Minor
>
> As of Spark 1.6.x we have unified memory management and hence 
> execution/storage memory changes over time for e.g. if execution grows it 
> will take memory from storage and vice-versa. But if we see Spark UI it shows 
> one storage value all the time in storage tab like it used to in previous 
> version <=1.5.2. Ideally storage memory values should get refreshed to show 
> real time value of storage memory in the Spark UI so we can actually 
> visualize that stealing between execution and storage is happening.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26504) Rope-wise dumping of Spark plans

2018-12-31 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26504.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Rope-wise dumping of Spark plans 
> -
>
> Key: SPARK-26504
> URL: https://issues.apache.org/jira/browse/SPARK-26504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, Spark plans are converted to string via StringBuilderWriter when 
> memory for strings are allocated sequentially as soon as elements of plans 
> are added to the StringBuilder.
> Proposed improvement is StringRope which has 2 methods:
> 1. append(s: String): Unit - adds the string to internal list and increases 
> total size
> 2. toString: String - converts the list of strings to strings



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26504) Rope-wise dumping of Spark plans

2018-12-31 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-26504:
-

Assignee: Maxim Gekk

> Rope-wise dumping of Spark plans 
> -
>
> Key: SPARK-26504
> URL: https://issues.apache.org/jira/browse/SPARK-26504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, Spark plans are converted to string via StringBuilderWriter when 
> memory for strings are allocated sequentially as soon as elements of plans 
> are added to the StringBuilder.
> Proposed improvement is StringRope which has 2 methods:
> 1. append(s: String): Unit - adds the string to internal list and increases 
> total size
> 2. toString: String - converts the list of strings to strings



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-12-31 Thread Samik R (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731349#comment-16731349
 ] 

Samik R commented on SPARK-19217:
-

Any update on this? Still seems useful: I am trying to get couple of values 
from a VectorUDT type.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26454) IllegegalArgument Exception is Thrown while creating new UDF with JAR

2018-12-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731320#comment-16731320
 ] 

Hyukjin Kwon commented on SPARK-26454:
--

The error message says the jar is already registered. Are you adding the jar 
multiple times somehow somewhere? 


> IllegegalArgument Exception is Thrown while creating new UDF with JAR
> -
>
> Key: SPARK-26454
> URL: https://issues.apache.org/jira/browse/SPARK-26454
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Udbhav Agrawal
>Priority: Major
> Attachments: create_exception.txt
>
>
> 【Test step】:
> 1.launch spark-shell
> 2. set role admin;
> 3. create new function
>   CREATE FUNCTION Func AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'
> 4. Do select on the function
> sql("select Func('2018-03-09')").show()
> 5.Create new UDF with same JAR
>    sql("CREATE FUNCTION newFunc AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'")
> 6. Do select on the new function created.
> sql("select newFunc ('2018-03-09')").show()
> 【Output】:
> Function is getting created but illegal argument exception is thrown , select 
> provides result but with illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
twice. Managed to overcome by skipping the temporary table and selecting 
directly from the cached dataframe, but was wondering if that is an expected 
behavior / known issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {
>   print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
> twice. Managed to overcome by skipping the temporary table and selecting 
> directly from the cached dataframe, but was wondering if that is an expected 
> behavior / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", new UDF().print _)

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF",(x:Int) => {
 print(x)
 x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

val df3 = df2.select("result")

df3.collect(){code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", new UDF().print _)
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. 
> I managed to overcome by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF",(x:Int) => {
 print(x)
 x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

val df3 = df2.select("result")

df3.collect(){code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

val df3 = df2.select("result")

df3.collect(){code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF",(x:Int) => {
>  print(x)
>  x
> })
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> val df3 = df2.select("result")
> df3.collect(){code}
>  
> 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. 
> I managed to overcome by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
{code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF", (x:Int) => {print(x)
  x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.createOrReplaceTempView("cached_table")

val df3 = session.sql("select result from cached_table")

df3.collect()
{code}
 

1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe twice.

Managed to mitigate by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
{code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF", new UDF().print _)

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.createOrReplaceTempView("cached_table")

val df3 = session.sql("select result from cached_table")

df3.collect()
{code}
 

1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
twice. Managed to overcome by skipping the temporary table and selecting 
directly from the cached dataframe, but was wondering if that is an expected 
behavior / known issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
> {code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = 

[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
{code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF", new UDF().print _)

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.createOrReplaceTempView("cached_table")

val df3 = session.sql("select result from cached_table")

df3.collect()
{code}
 

1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
twice. Managed to overcome by skipping the temporary table and selecting 
directly from the cached dataframe, but was wondering if that is an expected 
behavior / known issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
twice. Managed to overcome by skipping the temporary table and selecting 
directly from the cached dataframe, but was wondering if that is an expected 
behavior / known issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest_spark1.6.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {
>   print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
> {code:java|title=RegisterTest_spark2.3.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF", new UDF().print _)
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.createOrReplaceTempView("cached_table")
> val df3 = session.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints `123` while 2.3 prints `123123`, thus evaluating the dataframe 
> twice. Managed to overcome by skipping the temporary table and selecting 
> directly from the 

[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", new UDF().print _)

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {
>   print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. 
> I managed to overcome by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

val df3 = df2.select("result")

df3.collect(){code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

{{df2.collect() //execute cache}}

{{val df3 = df2.select("result")}}

{{df3.collect()}}
 
 
{{{code}

 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF",}}(x:Int) => {
>  print(x)
>  x
> })
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> val df3 = df2.select("result")
> df3.collect(){code}
>  
> 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. 
> I managed to overcome by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)
Hagai Attias created SPARK-26510:


 Summary: Spark 2.3 change of behavior (vs 1.6) when caching a 
dataframe and using 'createOrReplaceTempView'
 Key: SPARK-26510
 URL: https://issues.apache.org/jira/browse/SPARK-26510
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.3.0
Reporter: Hagai Attias


It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{{{code:title=RegisterTest.scala|borderStyle=solid} }}
 
{{val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))}}
{{val schema = StructType(StructField("num", IntegerType) :: Nil)}}

{{val df1 = session.createDataFrame(rdd, schema)}}
{{df1.createOrReplaceTempView("data_table")}}

{{session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
}){{)}}

{{val df2 = session.sql("select printUDF(num) result from data_table").cache()}}

{{df2.collect() //execute cache}}

{{val df3 = df2.select("result")}}

{{df3.collect()}}
 
 
{{{code}}}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26510) Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 'createOrReplaceTempView'

2018-12-31 Thread Hagai Attias (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
-
Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = session.createDataFrame(rdd, schema)
df1.createOrReplaceTempView("data_table")

session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
})

val df2 = session.sql("select printUDF(num) result from data_table").cache()

{{df2.collect() //execute cache}}

{{val df3 = df2.select("result")}}

{{df3.collect()}}
 
 
{{{code}

 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{{{code:title=RegisterTest.scala|borderStyle=solid} }}
 
{{val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))}}
{{val schema = StructType(StructField("num", IntegerType) :: Nil)}}

{{val df1 = session.createDataFrame(rdd, schema)}}
{{df1.createOrReplaceTempView("data_table")}}

{{session.udf.register("printUDF",}}(x:Int) => {
 print(x)
 x
}){{)}}

{{val df2 = session.sql("select printUDF(num) result from data_table").cache()}}

{{df2.collect() //execute cache}}

{{val df3 = df2.select("result")}}

{{df3.collect()}}
 
 
{{{code}}}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --
>
> Key: SPARK-26510
> URL: https://issues.apache.org/jira/browse/SPARK-26510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Hagai Attias
>Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = session.sparkContext.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = session.createDataFrame(rdd, schema)
> df1.createOrReplaceTempView("data_table")
> session.udf.register("printUDF",}}(x:Int) => {
>  print(x)
>  x
> })
> val df2 = session.sql("select printUDF(num) result from data_table").cache()
> {{df2.collect() //execute cache}}
> {{val df3 = df2.select("result")}}
> {{df3.collect()}}
>  
>  
> {{{code}
>  
> 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. 
> I managed to overcome by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26509) Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader

2018-12-31 Thread Filipe Gonzaga Miranda (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Filipe Gonzaga Miranda updated SPARK-26509:
---
Remaining Estimate: 40h  (was: 168h)
 Original Estimate: 40h  (was: 168h)

> Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader
> --
>
> Key: SPARK-26509
> URL: https://issues.apache.org/jira/browse/SPARK-26509
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Filipe Gonzaga Miranda
>Priority: Major
>   Original Estimate: 40h
>  Remaining Estimate: 40h
>
> I get the exception below Spark 2.4 reading parquet files where some columns 
> are DELTA_BYTE_ARRAY encoded.
>  
> {code:java}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
>  
> {code}
>  
> If the property:
> spark.sql.parquet.enableVectorizedReader is set to false that works
> The parquet files were written with Parquet V2, and as far as I understand 
> the V2 is the version used in Spark 2.x.
>  I did not find any property to change which Parquet Version Spark uses (V1, 
> V2).
> Is there anyway to benefit from the Vectorized Reader? Or this is something 
> like a new implementation to support this version? I would propose so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26509) Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's Vectorized Reader

2018-12-31 Thread Filipe Gonzaga Miranda (JIRA)
Filipe Gonzaga Miranda created SPARK-26509:
--

 Summary: Parquet DELTA_BYTE_ARRAY is not supported in Spark 2.x's 
Vectorized Reader
 Key: SPARK-26509
 URL: https://issues.apache.org/jira/browse/SPARK-26509
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: Filipe Gonzaga Miranda


I get the exception below Spark 2.4 reading parquet files where some columns 
are DELTA_BYTE_ARRAY encoded.

 
{code:java}
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
 
{code}
 

If the property:



spark.sql.parquet.enableVectorizedReader is set to false that works

The parquet files were written with Parquet V2, and as far as I understand the 
V2 is the version used in Spark 2.x.

 I did not find any property to change which Parquet Version Spark uses (V1, 
V2).

Is there anyway to benefit from the Vectorized Reader? Or this is something 
like a new implementation to support this version? I would propose so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26508) Address warning messages in Java by lgtm.com

2018-12-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26508:


Assignee: (was: Apache Spark)

> Address warning messages in Java by lgtm.com
> 
>
> Key: SPARK-26508
> URL: https://issues.apache.org/jira/browse/SPARK-26508
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> [lgtm.com|http://lgtm.com] provides automated code review of 
> Java/Python/JavaScript files for OSS projects.
> [Here|https://lgtm.com/projects/g/apache/spark/alerts/?mode=list=warning]
>  are warning messages regarding Apache Spark project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26508) Address warning messages in Java by lgtm.com

2018-12-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26508:


Assignee: Apache Spark

> Address warning messages in Java by lgtm.com
> 
>
> Key: SPARK-26508
> URL: https://issues.apache.org/jira/browse/SPARK-26508
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> [lgtm.com|http://lgtm.com] provides automated code review of 
> Java/Python/JavaScript files for OSS projects.
> [Here|https://lgtm.com/projects/g/apache/spark/alerts/?mode=list=warning]
>  are warning messages regarding Apache Spark project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26508) Address warning messages in Java by lgtm.com

2018-12-31 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-26508:


 Summary: Address warning messages in Java by lgtm.com
 Key: SPARK-26508
 URL: https://issues.apache.org/jira/browse/SPARK-26508
 Project: Spark
  Issue Type: Improvement
  Components: Examples, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


[lgtm.com|http://lgtm.com] provides automated code review of 
Java/Python/JavaScript files for OSS projects.

[Here|https://lgtm.com/projects/g/apache/spark/alerts/?mode=list=warning]
 are warning messages regarding Apache Spark project.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26500) Add conf to support ignore hdfs data locality

2018-12-31 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang resolved SPARK-26500.

Resolution: Not A Problem

> Add conf to support ignore hdfs data locality
> -
>
> Key: SPARK-26500
> URL: https://issues.apache.org/jira/browse/SPARK-26500
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Trivial
>
> When reading a large hive table/directory with thousands of files, it will 
> cost up to several minutes or even hours to calculate data locality for each 
> split in driver, while executors are in idle status.
> This situation is even worth when running in SparkThriftServer mode, because 
> handleJobSubmitted(it will call getPreferedLocation) is handled in a single 
> thread. One big sql will block all the following sqls.
> At the same time, most companies's internal networks are all gigabit network 
> cards, so it is ok to read a data not locality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org