date:20190220

[jira] [Created] (SPARK-26951) Should not throw KryoException when root cause is IOexception

2019-02-20 Thread zhoukang (JIRA)

zhoukang created SPARK-26951:


 Summary: Should not throw KryoException when root cause is 
IOexception
 Key: SPARK-26951
 URL: https://issues.apache.org/jira/browse/SPARK-26951
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: zhoukang


Job will failed with below exception:

{code:java}
Job aborted due to stage failure: Task 1576 in stage 97.0 failed 4 times, most 
recent failure: Lost task 1576.3 in stage 97.0 (TID 121949, xxx, executor 14): 
com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is 
corrupted. The lz4's magic number should be LZ4Block(4c5a34426c6f636b), but 
received buffer's head bytes is ().
{code}

{code:java}
Job aborted due to stage failure: Task 1576 in stage 97.0 failed 4 times, most 
recent failure: Lost task 1576.3 in stage 97.0 (TID 121949, xxx, executor 14): 
com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is 
corrupted. The lz4's magic number should be LZ4Block(4c5a34426c6f636b), but 
received buffer's head bytes is ().
at com.esotericsoftware.kryo.io.Input.fill(Input.java:166)
at com.esotericsoftware.kryo.io.Input.require(Input.java:196)
at com.esotericsoftware.kryo.io.Input.readVarInt(Input.java:373)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:127)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:244)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:180)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Stream is corrupted. The lz4's magic number 
should be LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
().
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:169)
at 
org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:127)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:164)
... 19 more

Driver stacktrace:
{code}
For IOException, it should retry




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values

2019-02-20 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26950:
-

 Summary: Make RandomDataGenerator use Float.NaN or Double.NaN for 
all NaN values
 Key: SPARK-26950
 URL: https://issues.apache.org/jira/browse/SPARK-26950
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.3.4, 2.4.2, 3.0.0
Reporter: Dongjoon Hyun


Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, 
but there exists more NaN values with different binary presentations.

{code}
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
{code}

`RandomDataGenerator` generates these NaN values. It's good, but it causes 
`checkEvaluationWithUnsafeProjection` failures due to the difference between 
`UnsafeRow` binary presentation. The following is the UT failure instance. This 
issue aims to fix this flakiness.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

{code}
Failed
org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema 
struct
 with seed -81044812370056695
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26950:


Assignee: Apache Spark

> Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
> ---
>
> Key: SPARK-26950
> URL: https://issues.apache.org/jira/browse/SPARK-26950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.4, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, 
> but there exists more NaN values with different binary presentations.
> {code}
> scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
> res1: Array[Byte] = Array(127, -64, 0, 0)
> scala> val x = java.lang.Float.intBitsToFloat(-6966608)
> x: Float = NaN
> scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
> res2: Array[Byte] = Array(-1, -107, -78, -80)
> {code}
> `RandomDataGenerator` generates these NaN values. It's good, but it causes 
> `checkEvaluationWithUnsafeProjection` failures due to the difference between 
> `UnsafeRow` binary presentation. The following is the UT failure instance. 
> This issue aims to fix this flakiness.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/
> {code}
> Failed
> org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema 
> struct
>  with seed -81044812370056695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26950:


Assignee: (was: Apache Spark)

> Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
> ---
>
> Key: SPARK-26950
> URL: https://issues.apache.org/jira/browse/SPARK-26950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.4, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, 
> but there exists more NaN values with different binary presentations.
> {code}
> scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
> res1: Array[Byte] = Array(127, -64, 0, 0)
> scala> val x = java.lang.Float.intBitsToFloat(-6966608)
> x: Float = NaN
> scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
> res2: Array[Byte] = Array(-1, -107, -78, -80)
> {code}
> `RandomDataGenerator` generates these NaN values. It's good, but it causes 
> `checkEvaluationWithUnsafeProjection` failures due to the difference between 
> `UnsafeRow` binary presentation. The following is the UT failure instance. 
> This issue aims to fix this flakiness.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/
> {code}
> Failed
> org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema 
> struct
>  with seed -81044812370056695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26825) Spark Structure Streaming job failing when submitted in cluster mode

2019-02-20 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773727#comment-16773727
 ] 

Jungtaek Lim commented on SPARK-26825:
--

Similar issue was reported (SPARK-19909) which root reason looks same - so I 
guess current PR for this issue would resolve SPARK-19909 as well.

> Spark Structure Streaming job failing when submitted in cluster mode
> 
>
> Key: SPARK-26825
> URL: https://issues.apache.org/jira/browse/SPARK-26825
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Andre Araujo
>Priority: Major
>
> I have a structured streaming job that runs successfully when launched in 
> "client" mode. However, when launched in "cluster" mode it fails with the 
> following weird messages on the error log. Note that the path in the error 
> message is actually a local filesystem path that has been mistakenly prefixed 
> with a {{hdfs://}} scheme.
> {code}
> 19/02/01 12:53:14 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(68f9fb30-5853-49b4-b192-f1e0483e0d95) to 
> hdfs://ns1/data/yarn/nm/usercache/root/appcache/application_1548823131831_0160/container_1548823131831_0160_02_01/tmp/temporary-3789423a-6ded-4084-aab3-3b6301c34e07/metadataorg.apache.hadoop.security.AccessControlException:
>  Permission denied: user=root, access=WRITE, 
> inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:400)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1853)
> {code}
> I dug a little bit into this and here's what I think it's going on:
> # When a new streaming query is created, the {{StreamingQueryManager}} 
> determines the checkpoint location 
> [here|https://github.com/apache/spark/blob/d811369ce23186cbb3208ad665e15408e13fea87/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L216].
>  If neither the user nor the Spark conf specify a checkpoint location, the 
> location is returned by a call to {{Utils.createTempDir(namePrefix = 
> s"temporary").getCanonicalPath}}. 
>Here, I see two issues:
> #* The canonical path returned by {{Utils.createTempDir}} does *not* have a 
> scheme ({{hdfs://}} or {{file://}}), so, it's ambiguous as to what type of 
> file system the path belongs to.
> #* Also note that the path returned by the {{Utils.createTempDir}} call is a 
> local path, not a HDFS path, as the paths returned by the other two 
> conditions. I executed {{Utils.createTempDir}} in a test job, both in cluster 
> and client modes, and the results are these:
> {code}
> *Client mode:*
> java.io.tmpdir=/tmp
> createTempDir(namePrefix = s"temporary") => 
> /tmp/temporary-c51f1466-fd50-40c7-b136-1f2f06672e25
> *Cluster mode:*
> java.io.tmpdir=/yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/
> createTempDir(namePrefix = s"temporary") => 
> /yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/temporary-47c13b28-14bd-4d1b-8acc-3e445948415e
> {code}
> # This temporary checkpoint location is then [passed to the 
> constructor|https://github.com/apache/spark/blob/d811369ce23186cbb3208ad665e15408e13fea87/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L276]
>  of the {{MicroBatchExecution}} instance
> # This is the point where [{{resolvedCheckpointRoot}} is 
> calculated|https://github.com/apache/spark/blob/755f9c20761e3db900c6c2b202cd3d2c5bbfb7c0/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L89].
>  Here, it's where things start to break: since the path returned by 
> {{Utils.createTempDir}} doesn't have a scheme, and since HDFS is the default 
> filesystem, the code resolves the path as being a HDFS path, rather than a 
> local one, as shown below:
> {code}
> scala> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.Path
> scala> // value returned by the Utils.createTempDir method
> scala> val checkpointRoot = 
> "/yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/temporary-47c13b28-14bd-4d1b-8acc-3e445948415e"
> checkpointRoot: String = 
> /yarn/nm/usercache/root/appcache/application_154906473_0029/container_154906473_0029_01_01/tmp/temporary-47c13b28-14bd-4d1b-8acc-3e445948415e
> scala> val checkpointPath = new

[jira] [Commented] (SPARK-24818) Ensure all the barrier tasks in the same stage are launched together

2019-02-20 Thread luzengxiang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773729#comment-16773729
 ] 

luzengxiang commented on SPARK-24818:
-

"cannot fulfill task locality requirements" keeps happening! 

> Ensure all the barrier tasks in the same stage are launched together
> 
>
> Key: SPARK-24818
> URL: https://issues.apache.org/jira/browse/SPARK-24818
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> When some executors/hosts are blacklisted, it may happen that only a part of 
> the tasks in the same barrier stage can be launched. We shall detect the case 
> and revert the allocated resource offers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26919) change maven default compile java home

2019-02-20 Thread daile (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

daile resolved SPARK-26919.
---
   Resolution: Done
Fix Version/s: 2.4.0

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption

2019-02-20 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773705#comment-16773705
 ] 

Jungtaek Lim commented on SPARK-26425:
--

Seems like no work was done in 2 months while observations in JIRA description 
make perfect sense to me.

[~tdas] Since you've assigned yourself, are you planning to work on this? If 
you're too busy to handle this I could take it over.

> Add more constraint checks in file streaming source to avoid checkpoint 
> corruption
> --
>
> Key: SPARK-26425
> URL: https://issues.apache.org/jira/browse/SPARK-26425
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Two issues observed in production. 
> - HDFSMetadataLog.getLatest() tries to read older versions when it is not 
> able to read the latest listed version file. Not sure why this was done but 
> this should not be done. If the latest listed file is not readable, then 
> something is horribly wrong and we should fail rather than report an older 
> version as that can completely corrupt the checkpoint directory. 
> - FileStreamSource should check whether adding the a new batch to the 
> FileStreamSourceLog succeeded or not (similar to how StreamExecution checks 
> for the OffsetSeqLog)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26949) Prevent "purge" to remove needed batch files in CompactibleFileStreamLog

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26949:


Assignee: Apache Spark

> Prevent "purge" to remove needed batch files in CompactibleFileStreamLog
> 
>
> Key: SPARK-26949
> URL: https://issues.apache.org/jira/browse/SPARK-26949
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Minor
>
> I've seen couple of trials (in opened PRs, even I've also tried) which calls 
> purge() in CompactibleFileStreamLog, but after looking at the codebase of 
> CompactibleFileStreamLog, I've realized that purging latest compaction batch 
> would break internal of CompactibleFileStreamLog and it throws 
> IllegalStateException.
> Given that CompactibleFileStreamLog maintains the batches and purges 
> according to its configuration, it would be safer to just rely on 
> CompactibleFileStreamLog to purge and prevent calling `purge` outside of 
> CompactibleFileStreamLog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26949) Prevent "purge" to remove needed batch files in CompactibleFileStreamLog

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26949:


Assignee: (was: Apache Spark)

> Prevent "purge" to remove needed batch files in CompactibleFileStreamLog
> 
>
> Key: SPARK-26949
> URL: https://issues.apache.org/jira/browse/SPARK-26949
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> I've seen couple of trials (in opened PRs, even I've also tried) which calls 
> purge() in CompactibleFileStreamLog, but after looking at the codebase of 
> CompactibleFileStreamLog, I've realized that purging latest compaction batch 
> would break internal of CompactibleFileStreamLog and it throws 
> IllegalStateException.
> Given that CompactibleFileStreamLog maintains the batches and purges 
> according to its configuration, it would be safer to just rely on 
> CompactibleFileStreamLog to purge and prevent calling `purge` outside of 
> CompactibleFileStreamLog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26949) Prevent "purge" to remove needed batch files in CompactibleFileStreamLog

2019-02-20 Thread Jungtaek Lim (JIRA)

Jungtaek Lim created SPARK-26949:


 Summary: Prevent "purge" to remove needed batch files in 
CompactibleFileStreamLog
 Key: SPARK-26949
 URL: https://issues.apache.org/jira/browse/SPARK-26949
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


I've seen couple of trials (in opened PRs, even I've also tried) which calls 
purge() in CompactibleFileStreamLog, but after looking at the codebase of 
CompactibleFileStreamLog, I've realized that purging latest compaction batch 
would break internal of CompactibleFileStreamLog and it throws 
IllegalStateException.

Given that CompactibleFileStreamLog maintains the batches and purges according 
to its configuration, it would be safer to just rely on 
CompactibleFileStreamLog to purge and prevent calling `purge` outside of 
CompactibleFileStreamLog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773632#comment-16773632
 ] 

Thincrs commented on SPARK-26946:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 28, 2019 3:27 AM

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26643) Spark Hive throw an AnalysisException,when set table properties.But this AnalysisException contains one typo and one unsuited word.

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26643.
---
Resolution: Not A Problem

> Spark Hive throw an AnalysisException,when set table properties.But this 
> AnalysisException contains one typo and one unsuited word.
> ---
>
> Key: SPARK-26643
> URL: https://issues.apache.org/jira/browse/SPARK-26643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Minor
>
> When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
> {code:java}
> spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
> org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
> hive metastore as table property keys may not start with 'spark.sql.': 
> [spark.sql.partitionProvider];
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
> I found the message of this exception contains one typo in "persistent"  and 
> one unsuited word "hive".
> So I think this analysis exception should change from
> {code:java}
> throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
> hive metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}
> to
> {code:java}
> throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
> metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26643) Spark Hive throw an AnalysisException,when set table properties.But this AnalysisException contains one typo and one unsuited word.

2019-02-20 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26643:
---
Description: 
When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
{code:java}
spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
hive metastore as table property keys may not start with 'spark.sql.': 
[spark.sql.partitionProvider];
at 
org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
I found the message of this exception contains one typo in "persistent"  and 
one unsuited word "hive".

So I think this analysis exception should change from
{code:java}
throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
hive metastore " +
 s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}
to
{code:java}
throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
metastore " +
 s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}

  was:
When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
{code:java}
spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
hive metastore as table property keys may not start with 'spark.sql.': 
[spark.sql.partitionProvider];
at 
org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
I found the message of this exception contains two typo.

one is persistent 

What is the function of the method named verifyTableProperties ? I check the 
comment of this method ,the comment contains :
{code:java}
/**
* If the given table properties contains datasource properties, throw an 
exception. We will do
* this check when create or alter a table, i.e. when we try to write table 
metadata to Hive
* metastore.
*/
{code}
So I think this analysis exception should change from
{code:java}
throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
hive metastore " +
 s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}
to
{code:java}
throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
metastore " +
 s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}


> Spark Hive throw an AnalysisException,when set table properties.But this 
> AnalysisException contains one typo and one unsuited word.
> ---
>
> Key: SPARK-26643
> URL: https://issues.apache.org/jira/browse/SPARK-26643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Minor
>
> When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
> {code:java}
> spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
> org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
> hive metastore as table property keys may not start with 'spark.sql.': 
> [spark.sql.partitionProvider];
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
> I found the message of this exception contains one typo in "persistent"  and 
> one unsuited word "hive".
> So I think this analysis exception should change from
> {code:java}
> throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
> hive metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}
> to
> {code:java}
> throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
> metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26643) Spark Hive throw an AnalysisException,when set table properties.But this AnalysisException contains one typo and one unsuited word.

2019-02-20 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26643:
---
Summary: Spark Hive throw an AnalysisException,when set table 
properties.But this AnalysisException contains one typo and one unsuited word.  
(was: Spark Hive throw an AnalysisException,when set table properties.But this 
AnalysisException contains two typo.)

> Spark Hive throw an AnalysisException,when set table properties.But this 
> AnalysisException contains one typo and one unsuited word.
> ---
>
> Key: SPARK-26643
> URL: https://issues.apache.org/jira/browse/SPARK-26643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Minor
>
> When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
> {code:java}
> spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
> org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
> hive metastore as table property keys may not start with 'spark.sql.': 
> [spark.sql.partitionProvider];
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
> I found the message of this exception contains two typo.
> one is persistent 
> What is the function of the method named verifyTableProperties ? I check the 
> comment of this method ,the comment contains :
> {code:java}
> /**
> * If the given table properties contains datasource properties, throw an 
> exception. We will do
> * this check when create or alter a table, i.e. when we try to write table 
> metadata to Hive
> * metastore.
> */
> {code}
> So I think this analysis exception should change from
> {code:java}
> throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
> hive metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}
> to
> {code:java}
> throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
> metastore " +
>  s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26643) Spark Hive throw an AnalysisException,when set table properties.But this AnalysisException contains two typo.

2019-02-20 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26643:
---
Description: 
When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
{code:java}
spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
hive metastore as table property keys may not start with 'spark.sql.': 
[spark.sql.partitionProvider];
at 
org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
I found the message of this exception contains two typo.

one is persistent 

What is the function of the method named verifyTableProperties ? I check the 
comment of this method ,the comment contains :
{code:java}
/**
* If the given table properties contains datasource properties, throw an 
exception. We will do
* this check when create or alter a table, i.e. when we try to write table 
metadata to Hive
* metastore.
*/
{code}
So I think this analysis exception should change from
{code:java}
throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
hive metastore " +
 s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}
to
{code:java}
throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
metastore " +
 s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}

  was:
When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
{code:java}
spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
hive metastore as table property keys may not start with 'spark.sql.': 
[spark.sql.partitionProvider];
at 
org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
I found the message of this exception is not orrect.

What is the function of the method named verifyTableProperties ? I check the 
comment of this method ,the comment contains :
{code:java}
/**
* If the given table properties contains datasource properties, throw an 
exception. We will do
* this check when create or alter a table, i.e. when we try to write table 
metadata to Hive
* metastore.
*/
{code}
So I think this analysis exception should change from
{code:java}
throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
hive metastore " +
 s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}
to
{code:java}
throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
metastore " +
 s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
 invalidKeys.mkString("[", ", ", "]")){code}


> Spark Hive throw an AnalysisException,when set table properties.But this 
> AnalysisException contains two typo.
> -
>
> Key: SPARK-26643
> URL: https://issues.apache.org/jira/browse/SPARK-26643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Minor
>
> When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
> {code:java}
> spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
> org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
> hive metastore as table property keys may not start with 'spark.sql.': 
> [spark.sql.partitionProvider];
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
> I found the message of this exception contains two typo.
> one is persistent 
> What is the function of the method named verifyTableProperties ? I check the 
> comment of this method ,the comment contains :
> {code:java}
> /**
> * If the given table properties contains datasource properties, throw an 
> exception. We will do
> * this check when create or alter a table, i.e. when we try to write table 
> metadata to Hive
> * metastore.
> */
> {code}
> So I think this analysis exception should change from
> {code:java}
> throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
> hive metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}
> to
> {code:java}
> throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
> metastore " +
>  s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-26643) Spark Hive throw an AnalysisException,when set table properties.But this AnalysisException contains two typo.

2019-02-20 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26643:
---
Summary: Spark Hive throw an AnalysisException,when set table 
properties.But this AnalysisException contains two typo.  (was: Spark Hive 
throw an analysis exception,when set table properties.But this )

> Spark Hive throw an AnalysisException,when set table properties.But this 
> AnalysisException contains two typo.
> -
>
> Key: SPARK-26643
> URL: https://issues.apache.org/jira/browse/SPARK-26643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Minor
>
> When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
> {code:java}
> spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
> org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
> hive metastore as table property keys may not start with 'spark.sql.': 
> [spark.sql.partitionProvider];
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
> I found the message of this exception is not orrect.
> What is the function of the method named verifyTableProperties ? I check the 
> comment of this method ,the comment contains :
> {code:java}
> /**
> * If the given table properties contains datasource properties, throw an 
> exception. We will do
> * this check when create or alter a table, i.e. when we try to write table 
> metadata to Hive
> * metastore.
> */
> {code}
> So I think this analysis exception should change from
> {code:java}
> throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
> hive metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}
> to
> {code:java}
> throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
> metastore " +
>  s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26643) Spark Hive throw an analysis exception,when set table properties.But this

2019-02-20 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26643:
---
Summary: Spark Hive throw an analysis exception,when set table 
properties.But this   (was: Spark hive throwing an incorrect analysis 
exception,when set table properties.)

> Spark Hive throw an analysis exception,when set table properties.But this 
> --
>
> Key: SPARK-26643
> URL: https://issues.apache.org/jira/browse/SPARK-26643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Minor
>
> When I execute a DDL in spark-sql,throwing a AnalysisException as follows:
> {code:java}
> spark-sql> ALTER TABLE gja_test3 SET TBLPROPERTIES ('test' = 'test');
> org.apache.spark.sql.AnalysisException: Cannot persistent work.gja_test3 into 
> hive metastore as table property keys may not start with 'spark.sql.': 
> [spark.sql.partitionProvider];
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:129){code}
> I found the message of this exception is not orrect.
> What is the function of the method named verifyTableProperties ? I check the 
> comment of this method ,the comment contains :
> {code:java}
> /**
> * If the given table properties contains datasource properties, throw an 
> exception. We will do
> * this check when create or alter a table, i.e. when we try to write table 
> metadata to Hive
> * metastore.
> */
> {code}
> So I think this analysis exception should change from
> {code:java}
> throw new AnalysisException(s"Cannot persistent ${table.qualifiedName} into 
> hive metastore " +
>  s"as table property keys may not start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}
> to
> {code:java}
> throw new AnalysisException(s"Cannot persist ${table.qualifiedName} into Hive 
> metastore " +
>  s"as table property keys may start with '$SPARK_SQL_PREFIX': " +
>  invalidKeys.mkString("[", ", ", "]")){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-02-20 Thread gavin hu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773574#comment-16773574
 ] 

gavin hu commented on SPARK-24935:
--

Hi [~pgandhi]  "A user of sketches library..." That's me! I'm so excited that 
the issue I reported couple months back has got fixed here. Kudos to you for 
the great work!

As you may know many users including me are moving towards Spark 2.4.1, getting 
the fix there will definitely benefit a lot of people. Looking forwards to 
that. 

 

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773556#comment-16773556
 ] 

Apache Spark commented on SPARK-26946:
--

User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/23848

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26946:


Assignee: (was: Apache Spark)

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26946:


Assignee: Apache Spark

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773553#comment-16773553
 ] 

Apache Spark commented on SPARK-26946:
--

User 'jzhuge' has created a pull request for this issue:
https://github.com/apache/spark/pull/23848

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26948) vertex and edge rowkey upgrade and support multiple types?

2019-02-20 Thread daile (JIRA)

daile created SPARK-26948:
-

 Summary: vertex and edge rowkey upgrade and support multiple types?
 Key: SPARK-26948
 URL: https://issues.apache.org/jira/browse/SPARK-26948
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 2.4.0
Reporter: daile


Currently only Long is supported, but most of the graph databases use string as 
the primary key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-02-20 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773541#comment-16773541
 ] 

Parth Gandhi commented on SPARK-24935:
--

Hi [~zanderl], thank you for your comment. Will do my best to work on the 
reviews, rest depends on the reviewers:) 

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26946:
---
Component/s: (was: Spark Core)

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-02-20 Thread Zander Lichstein (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773533#comment-16773533
 ] 

Zander Lichstein commented on SPARK-24935:
--

Glad to see this has been fixed!  What are the chances this will get into 2.4.1?

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-20 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26824.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-20 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773457#comment-16773457
 ] 

Parth Gandhi edited comment on SPARK-26947 at 2/20/19 10:43 PM:


I am unable to attach the dummy dataset as the size of the data(90 MB) exceeds 
the maximum allowed size 60 MB. Have attached it to Drive.

 

https://drive.google.com/file/d/1GlHQmwFD2VB9PUi5mDaXdZNXI50dnPYs/view?usp=sharing


was (Author: pgandhi):
I am unable to attach the dummy dataset as the size of the data(90 MB) exceeds 
the maximum allowed size 60 MB.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username

[jira] [Assigned] (SPARK-26892) saveAsTextFile throws NullPointerException when null row present

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26892:
-

Assignee: liupengcheng

> saveAsTextFile throws NullPointerException  when null row present 
> --
>
> Key: SPARK-26892
> URL: https://issues.apache.org/jira/browse/SPARK-26892
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
>
> We encoutered this problem in our production cluster, it can be reproduced by 
> the following code:
> {code:java}
> scala> sc.parallelize(Seq(1,null),1).saveAsTextFile("/tmp/foobar.dat")
> 19/02/15 21:39:17 ERROR Utils: Aborting task
> java.lang.NullPointerException
> at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$3(RDD.scala:1510)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:129)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1352)
> at 
> org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:127)
> at 
> org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1318)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26892) saveAsTextFile throws NullPointerException when null row present

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26892.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23799
[https://github.com/apache/spark/pull/23799]

> saveAsTextFile throws NullPointerException  when null row present 
> --
>
> Key: SPARK-26892
> URL: https://issues.apache.org/jira/browse/SPARK-26892
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
> Fix For: 3.0.0
>
>
> We encoutered this problem in our production cluster, it can be reproduced by 
> the following code:
> {code:java}
> scala> sc.parallelize(Seq(1,null),1).saveAsTextFile("/tmp/foobar.dat")
> 19/02/15 21:39:17 ERROR Utils: Aborting task
> java.lang.NullPointerException
> at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$3(RDD.scala:1510)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
> at 
> org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:129)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1352)
> at 
> org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:127)
> at 
> org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:425)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1318)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:428)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-20 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773457#comment-16773457
 ] 

Parth Gandhi commented on SPARK-26947:
--

I am unable to attach the dummy dataset as the size of the data(90 MB) exceeds 
the maximum allowed size 60 MB.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-20 Thread Parth Gandhi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-26947:
-
Attachment: clustering_app.py

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Created] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-20 Thread Parth Gandhi (JIRA)

Parth Gandhi created SPARK-26947:


 Summary: Pyspark KMeans Clustering job fails on large values of k
 Key: SPARK-26947
 URL: https://issues.apache.org/jira/browse/SPARK-26947
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 2.4.0
Reporter: Parth Gandhi


We recently had a case where a user's pyspark job running KMeans clustering was 
failing for large values of k. I was able to reproduce the same issue with 
dummy dataset. I have attached the code as well as the data in the JIRA. The 
stack trace is printed below from Java:

 
{code:java}
Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
at java.lang.StringBuilder.append(StringBuilder.java:202)
at py4j.Protocol.getOutputCommand(Protocol.java:328)
at py4j.commands.CallCommand.execute(CallCommand.java:81)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
{code}
Python:
{code:java}
Traceback (most recent call last):
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 985, in send_command
response = connection.send_command(command)
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "clustering_app.py", line 154, in 
main(args)
  File "clustering_app.py", line 145, in main
run_clustering(sc, args.input_path, args.output_path, 
args.num_clusters_list)
  File "clustering_app.py", line 136, in run_clustering
clustersTable, cluster_Centers = clustering(sc, documents, output_path, k, 
max_iter)
  File "clustering_app.py", line 68, in clustering
cluster_Centers = km_model.clusterCenters()
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
 line 337, in clusterCenters
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
 line 55, in _call_java
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
 line 109, in _java2py
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1257, in __call__
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
  File 
"/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
 line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.ml.python.MLSerDe.dumps
{code}
The command with which the application was launched is given below:
{code:java}
$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
~/clustering_app.py --input_path hdfs:///user/username/part-v001x --output_path 
hdfs:///user/username --num_clusters_list 1
{code}
The input dataset is approximately 90 MB in size and the assigned heap memory 
to both driver and executor is close to 20 GB. This only happens for large 
values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26946:
---
Description: 
Propose semantics for identifiers and a listing API to support multiple 
catalogs.

[~rdblue]'s SPIP: 
[https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]

  was:
Propose semantics for identifiers and a listing API to support multiple 
catalogs.

Ryan's SPIP: 
[https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]


> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-20 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773381#comment-16773381
 ] 

Hyukjin Kwon commented on SPARK-26858:
--

Oh, I see. Sorry there was misunderstanding. I think you're right.

 All the ways are different from Python side, and gapplyCollect workaround is 
pretty easy.

So, I was wondering if we need to fix it for now .. what I tried is 1. way. I 
haven't tried other ways.

1. way looked minimised way and easy (but hacky) For 2. and 3., I am not sure 
because it looks pretty different from Python side and I guess it needs 
considerable codes..

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26877) Support user-level app staging directory in yarn mode when spark.yarn.stagingDir specified

2019-02-20 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26877.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23786
[https://github.com/apache/spark/pull/23786]

> Support user-level app staging directory in yarn mode when 
> spark.yarn.stagingDir specified
> --
>
> Key: SPARK-26877
> URL: https://issues.apache.org/jira/browse/SPARK-26877
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, when running applications on yarn mode, the app staging directory 
> of  is controlled by `spark.yarn.stagingDir` config if specified, and this 
> directory cannot separate different users, somtimes, it's inconvenient for 
> file and quota management for users.
> For example, user may have different quota for their own app staging dir. and 
> they may also not want to use the home directory as the staging dir, because 
> it might used mainly for private user files or data. The quota may also be 
> different between private data and the app staging dir.
> So I propose to add user sub directories under this app staging dir.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26877) Support user-level app staging directory in yarn mode when spark.yarn.stagingDir specified

2019-02-20 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26877:
--

Assignee: liupengcheng

> Support user-level app staging directory in yarn mode when 
> spark.yarn.stagingDir specified
> --
>
> Key: SPARK-26877
> URL: https://issues.apache.org/jira/browse/SPARK-26877
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Minor
>
> Currently, when running applications on yarn mode, the app staging directory 
> of  is controlled by `spark.yarn.stagingDir` config if specified, and this 
> directory cannot separate different users, somtimes, it's inconvenient for 
> file and quota management for users.
> For example, user may have different quota for their own app staging dir. and 
> they may also not want to use the home directory as the staging dir, because 
> it might used mainly for private user files or data. The quota may also be 
> different between private data and the app staging dir.
> So I propose to add user sub directories under this app staging dir.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26729) Spark on Kubernetes tooling hardcodes default image names

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26729:


Assignee: (was: Apache Spark)

> Spark on Kubernetes tooling hardcodes default image names
> -
>
> Key: SPARK-26729
> URL: https://issues.apache.org/jira/browse/SPARK-26729
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Major
>
> Both when creating images with {{bin/docker-image-tool.sh}} and when running 
> the Kubernetes integration tests the image names are hardcoded to {{spark}}, 
> {{spark-py}} and {{spark-r}}.
> If you are producing custom images in some other way (e.g. a CI/CD process 
> that doesn't use the script) or are required to use a different naming 
> convention due to company policy e.g. prefixing with vendor name (e.g. 
> {{apache-spark}}) then you can't directly create/test your images with the 
> desired names.
> You can of course simply re-tag the images with the desired names but this 
> might not be possible in some CI/CD pipelines especially if naming 
> conventions are being enforced at the registry level.
> It would be nice if the default image names were customisable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26729) Spark on Kubernetes tooling hardcodes default image names

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26729:


Assignee: Apache Spark

> Spark on Kubernetes tooling hardcodes default image names
> -
>
> Key: SPARK-26729
> URL: https://issues.apache.org/jira/browse/SPARK-26729
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Major
>
> Both when creating images with {{bin/docker-image-tool.sh}} and when running 
> the Kubernetes integration tests the image names are hardcoded to {{spark}}, 
> {{spark-py}} and {{spark-r}}.
> If you are producing custom images in some other way (e.g. a CI/CD process 
> that doesn't use the script) or are required to use a different naming 
> convention due to company policy e.g. prefixing with vendor name (e.g. 
> {{apache-spark}}) then you can't directly create/test your images with the 
> desired names.
> You can of course simply re-tag the images with the desired names but this 
> might not be possible in some CI/CD pipelines especially if naming 
> conventions are being enforced at the registry level.
> It would be nice if the default image names were customisable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26884) Let task acquire memory accurately when using spilled memory

2019-02-20 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773286#comment-16773286
 ] 

Thincrs commented on SPARK-26884:
-

A user of thincrs has selected this issue. Deadline: Wed, Feb 27, 2019 7:01 PM

> Let task acquire memory accurately when using spilled memory
> 
>
> Key: SPARK-26884
> URL: https://issues.apache.org/jira/browse/SPARK-26884
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When task can't get required execution memory, it will call *spill()* of 
> consumers to release more memory. After spilling, it tries again to acquire 
> memory needed which is *(required - got)*. That's not accurate as actually 
> *released* memory by spilling may not equal to size it try to spill. 
> So it may be better to acquire memory in more accurate size :
> 1. when *released* >= needed, acquire needed size. (*required - got*)
> 2. when *released* < needed, acquire released size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-20 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773284#comment-16773284
 ] 

Bryan Cutler commented on SPARK-26858:
--

{quote}
(One other possibility I was thinking about batches without Schema is that we 
just send Arrow batch by Arrow batch, deserialize each batch to RecordBatch 
instance, and then construct an Arrow table, which is pretty different from 
Python side and hacky)
{quote}

The point from my previous comment is that you can't deserialize a RecordBatch 
and make a Table without a schema.

I think these options make the most sense:

1) Have the user define a schema beforehand, then you can just send serialized 
RecordBatches back to the driver.

2) Send a complete Arrow stream (schema + RecordBatches) from the executor, 
then merge streams in the driver JVM, discarding duplicate schemas, and send 
one final stream to R.

3) Same as (2) but instead of merging streams, send each separate stream 
through to the R driver where they are read and concatenated into one Table - 
I'm not sure if the Arrow R api will support this though.

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread John Zhuge (JIRA)

John Zhuge created SPARK-26946:
--

 Summary: Identifiers for multi-catalog Spark
 Key: SPARK-26946
 URL: https://issues.apache.org/jira/browse/SPARK-26946
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.3.2
Reporter: John Zhuge


Propose semantics for identifiers and a listing API to support multiple 
catalogs.

Ryan's SPIP: 
[https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22709) move config related infrastructure from Spark Core to a new module

2019-02-20 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773111#comment-16773111
 ] 

Wenchen Fan commented on SPARK-22709:
-

it's not about where we define configs, it's about whether we can use the 
config framework to define configs. Currently the network module can't define 
configs, but has to hardcode the config names, because it doesn't have the 
config framework.

> move config related infrastructure from Spark Core to a new module
> --
>
> Key: SPARK-22709
> URL: https://issues.apache.org/jira/browse/SPARK-22709
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Nowadays the config related infrastructure is in the Spark Core module, and 
> we use this infrastructure to centralize Spark configs, e.g. the 
> `org.apache.spark.internal.config`, the SQLConf, etc.
> However, for modules that don't depend on Core, like the network module, we 
> don't have this infrastructure and the configs are scattered in the code base.
> We should move the config infrastructure to a new module: spark-configs, so 
> that all other modules can use this infrastructure to centralize their 
> configs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26945) Python streaming tests flaky while cleaning temp directories after StreamingQuery.stop

2019-02-20 Thread Alessandro Bellina (JIRA)

Alessandro Bellina created SPARK-26945:
--

 Summary: Python streaming tests flaky while cleaning temp 
directories after StreamingQuery.stop
 Key: SPARK-26945
 URL: https://issues.apache.org/jira/browse/SPARK-26945
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Alessandro Bellina


>From the test code, it seems like the `shmutil.rmtree` function is trying to 
>delete a directory, but there's likely another thread adding entries to a 
>directory, so when it gets to `os.rmdir(path)` it blows up. I think the test 
>(and other streaming tests) should call `q.awaitTermination` after `q.stop`, 
>before going on. I'll file a separate jira.
{noformat}
ERROR: test_query_manager_await_termination 
(pyspark.sql.tests.test_streaming.StreamingTests)
--
Traceback (most recent call last):
 File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py",
 line 259, in test_query_manager_await_termination
 shutil.rmtree(tmpPath)
 File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree
 onerror(os.rmdir, path, sys.exc_info())
 File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree
 os.rmdir(path)
OSError: [Errno 39] Directory not empty: 
'/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26903) Remove the TimeZone cache

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26903:
--
Docs Text: Prior to Spark 3, if an invalid timezone was specified to 
to_utc_timestamp and from_utc_timestamp, it would silently be interpreted as 
GMT. It now results in an exception.
   Labels: release-notes  (was: )

> Remove the TimeZone cache
> -
>
> Key: SPARK-26903
> URL: https://issues.apache.org/jira/browse/SPARK-26903
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: release-notes
>
> The ZoneOffset class in JDK 8 caches zone offsets internally: 
> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/time/ZoneOffset.java#l205
>  . No need to cache time zones in Spark. The ticket aims to remove 
> computedTimeZones from DateTimeUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26900) Simplify truncation to quarter of year

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26900.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23808
[https://github.com/apache/spark/pull/23808]

> Simplify truncation to quarter of year
> --
>
> Key: SPARK-26900
> URL: https://issues.apache.org/jira/browse/SPARK-26900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, truncation a timestamp to quarter of year is performed via 
> truncation to month and separate quarter calculation. This can be done by 
> directly truncating local date to quarter:
> {code:java}
> LocalDate firstDayOfQuarter = date.with(IsoFields.DAY_OF_QUARTER, 1L);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-20 Thread Alessandro Bellina (JIRA)

Alessandro Bellina created SPARK-26944:
--

 Summary: Python unit-tests.log not available in artifacts for a 
build in Jenkins
 Key: SPARK-26944
 URL: https://issues.apache.org/jira/browse/SPARK-26944
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Alessandro Bellina


I had a pr where the python unit tests failed.  The tests point at the 
`/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
but I can't get to that from jenkins UI it seems (are all prs writing to the 
same file?).

This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26900) Simplify truncation to quarter of year

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26900:
-

Assignee: Maxim Gekk

> Simplify truncation to quarter of year
> --
>
> Key: SPARK-26900
> URL: https://issues.apache.org/jira/browse/SPARK-26900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, truncation a timestamp to quarter of year is performed via 
> truncation to month and separate quarter calculation. This can be done by 
> directly truncating local date to quarter:
> {code:java}
> LocalDate firstDayOfQuarter = date.with(IsoFields.DAY_OF_QUARTER, 1L);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22798.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23741
[https://github.com/apache/spark/pull/23741]

> Add multiple column support to PySpark StringIndexer
> 
>
> Key: SPARK-22798
> URL: https://issues.apache.org/jira/browse/SPARK-22798
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22798:
-

Assignee: Huaxin Gao

> Add multiple column support to PySpark StringIndexer
> 
>
> Key: SPARK-22798
> URL: https://issues.apache.org/jira/browse/SPARK-22798
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22709) move config related infrastructure from Spark Core to a new module

2019-02-20 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773072#comment-16773072
 ] 

Sean Owen commented on SPARK-22709:
---

Saw this late, but I'm not sure about it. This means a new module that 
everything depends on has configs for everything. I think it's not necessarily 
a win over, say, having {{config/package.scala}} or equivalent for individual 
modules. The existing dependency graph ought to let dependent modules, where 
needed, see these module-specific configs.

> move config related infrastructure from Spark Core to a new module
> --
>
> Key: SPARK-22709
> URL: https://issues.apache.org/jira/browse/SPARK-22709
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Nowadays the config related infrastructure is in the Spark Core module, and 
> we use this infrastructure to centralize Spark configs, e.g. the 
> `org.apache.spark.internal.config`, the SQLConf, etc.
> However, for modules that don't depend on Core, like the network module, we 
> don't have this infrastructure and the configs are scattered in the code base.
> We should move the config infrastructure to a new module: spark-configs, so 
> that all other modules can use this infrastructure to centralize their 
> configs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-20 Thread Sergey Derugo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773064#comment-16773064
 ] 

Sergey Derugo commented on SPARK-22000:
---

I've got similar issue. My sample code is attached.

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Priority: Major
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9135) Filter fails when filtering with a method reference to overloaded method

2019-02-20 Thread Valeria Vasylieva (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773067#comment-16773067
 ] 

Valeria Vasylieva commented on SPARK-9135:
--

I would like to work on it

> Filter fails when filtering with a method reference to overloaded method
> 
>
> Key: SPARK-9135
> URL: https://issues.apache.org/jira/browse/SPARK-9135
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.4.0
>Reporter: Mateusz Michalowski
>Priority: Major
>
> Filter fails when filtering with a method reference to overloaded method.
> In the example below we filter by Fruit::isRed, which is overloaded by 
> Apple::isRed and Banana::isRed. 
> {code}
> apples.filter(Fruit::isRed)
> bananas.filter(Fruit::isRed) //throws!
> {code}
> Spark will try to cast Apple::isRed to Banana::isRed - and then throw as a 
> result.
> However if we filter more generic rdd first - all works fine
> {code}
> fruit.filter(Fruit::isRed)
> bananas.filter(Fruit::isRed) //works fine!
> {code}
> It also works well if we use lambda instead of the method reference
> {code}
> apples.filter(f -> f.isRed())
> bananas.filter(f -> f.isRed()) //works fine!
> {code} 
> I attach a test setup below:
> {code:java}
> package com.doggybites;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.junit.After;
> import org.junit.Before;
> import org.junit.Test;
> import java.io.Serializable;
> import java.util.Arrays;
> import static org.hamcrest.CoreMatchers.equalTo;
> import static org.junit.Assert.assertThat;
> public class SparkTest {
> static abstract class Fruit implements Serializable {
> abstract boolean isRed();
> }
> static class Banana extends Fruit {
> @Override
> boolean isRed() {
> return false;
> }
> }
> static class Apple extends Fruit {
> @Override
> boolean isRed() {
> return true;
> }
> }
> private JavaSparkContext sparkContext;
> @Before
> public void setUp() throws Exception {
> SparkConf sparkConf = new 
> SparkConf().setAppName("test").setMaster("local[2]");
> sparkContext = new JavaSparkContext(sparkConf);
> }
> @After
> public void tearDown() throws Exception {
> sparkContext.stop();
> }
> private  JavaRDD toRdd(T ... array) {
> return sparkContext.parallelize(Arrays.asList(array));
> }
> @Test
> public void filters_apples_and_bananas_with_method_reference() {
> JavaRDD appleRdd = toRdd(new Apple());
> JavaRDD bananaRdd = toRdd(new Banana());
> 
> long redAppleCount = appleRdd.filter(Fruit::isRed).count();
> long redBananaCount = bananaRdd.filter(Fruit::isRed).count();
> assertThat(redAppleCount, equalTo(1L));
> assertThat(redBananaCount, equalTo(0L));
> }
> }
> {code}
> The test above throws:
> {code}
> 15/07/17 14:10:04 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> java.lang.ClassCastException: com.doggybites.SparkTest$Banana cannot be cast 
> to com.doggybites.SparkTest$Apple
>   at com.doggybites.SparkTest$$Lambda$2/976119300.call(Unknown Source)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1626)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1099)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/07/17 14:10:04 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 3, 
> localhost): java.lang.ClassCastException: com.doggybites.SparkTest$Banana 
> cannot be cast to com.doggybites.SparkTest$Apple
>   at com.doggybites.SparkTest$$Lambda$2/976119300.call(Unknown Source)
>   at 
> org.apache.spark.api.java.JavaRDD$$anonfun$filter$1.apply(JavaRDD.scala:78)
>   at

[jira] [Updated] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-20 Thread Sergey Derugo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Derugo updated SPARK-22000:
--
Attachment: testcase.zip

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Priority: Major
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26943) Weird behaviour with `.cache()`

2019-02-20 Thread Will Uto (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Uto updated SPARK-26943:
-
Description: 
 
{code:java}
sdf.count(){code}
 

works fine. However:

 
{code:java}
sdf = sdf.cache()
sdf.count()

{code}
 does not, and produces error
{code:java}
Py4JJavaError: An error occurred while calling o314.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 in 
stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 (TID 
438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
number: "(N/A)"
at java.text.NumberFormat.parse(NumberFormat.java:350)
{code}

  was:
 
{code:java}
sdf.count(){code}
 

works fine. However:

 
{code:java}
sdf = sdf.cache()
sdf.count()

{code}
 does not, and produces error
{code:java}
Py4JJavaError: An error occurred while calling o314.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 in 
stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 (TID 
438, uat-datanode-02.mint.ukho.gov.uk, executor 1): java.text.ParseException: 
Unparseable number: "(N/A)"
at java.text.NumberFormat.parse(NumberFormat.java:350)
{code}


> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25810) Spark structured streaming logs auto.offset.reset=earliest even though startingOffsets is set to latest

2019-02-20 Thread Valeria Vasylieva (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773044#comment-16773044
 ] 

Valeria Vasylieva commented on SPARK-25810:
---

I suppose the cause is here: 
[KafkaSourceProvider:521|https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L517]
{code:java}
// Set to "earliest" to avoid exceptions. However, KafkaSource will fetch the 
initial
// offsets by itself instead of counting on KafkaConsumer.
.set(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
{code}
But I do not really think that it should be fixed, as Spark defines custom 
algorithm for offset checking and fetching in 
[KafkaSource|https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala],
 thanks to it we can avoid Kafka errors on some offsets that do not exist etc.

[ConsumerConfig|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java]
 belongs to Kafka, so we cannot just change the logging here.

> Spark structured streaming logs auto.offset.reset=earliest even though 
> startingOffsets is set to latest
> ---
>
> Key: SPARK-25810
> URL: https://issues.apache.org/jira/browse/SPARK-25810
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: ANUJA BANTHIYA
>Priority: Trivial
>
> I have a  issue when i'm trying to read data from kafka using spark 
> structured streaming. 
> Versions : spark-core_2.11 : 2.3.1, spark-sql_2.11 : 2.3.1, 
> spark-sql-kafka-0-10_2.11 : 2.3.1, kafka-client :0.11.0.0
> The issue i am facing is that the spark job always logs auto.offset.reset = 
> earliest  even though latest option is specified in the code during startup 
> of application .
> Code to reproduce: 
> {code:java}
> package com.informatica.exec
> import org.apache.spark.sql.SparkSession
> object kafkaLatestOffset {
>  def main(s: Array[String]) {
>  val spark = SparkSession
>  .builder()
>  .appName("Spark Offset basic example")
>  .master("local[*]")
>  .getOrCreate()
>  val df = spark
>  .readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", "localhost:9092")
>  .option("subscribe", "topic1")
>  .option("startingOffsets", "latest")
>  .load()
>  val query = df.writeStream
>  .outputMode("complete")
>  .format("console")
>  .start()
>  query.awaitTermination()
>  }
> }
> {code}
>  
> As mentioned in Structured streaming doc, {{startingOffsets}}  need to be set 
> for auto.offset.reset.
> [https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html]
>  * *auto.offset.reset*: Set the source option {{startingOffsets}} to specify 
> where to start instead. Structured Streaming manages which offsets are 
> consumed internally, rather than rely on the kafka Consumer to do it. This 
> will ensure that no data is missed when new topics/partitions are dynamically 
> subscribed. Note that {{startingOffsets}} only applies when a new streaming 
> query is started, and that resuming will always pick up from where the query 
> left off.
> During runtime , kafka messages are picked from the latest offset , so 
> functional wise it is working as expected. Only log is misleading as it logs  
> auto.offset.reset = *earliest* .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26859) Fix field writer index bug in non-vectorized ORC deserializer

2019-02-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26859.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23766
[https://github.com/apache/spark/pull/23766]

> Fix field writer index bug in non-vectorized ORC deserializer
> -
>
> Key: SPARK-26859
> URL: https://issues.apache.org/jira/browse/SPARK-26859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ivan Vergiliev
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0, 2.4.1
>
>
> There is a bug in the ORC deserialization code that, when triggered, results 
> in completely wrong data being read. I've marked this as a Blocker as per the 
> docs in https://spark.apache.org/contributing.html as it's a data correctness 
> issue.
> The bug is triggered when the following set of conditions are all met:
> - the non-vectorized ORC reader is being used;
> - a schema is explicitly specified when reading the ORC file
> - the provided schema has columns not present in the ORC file, and these 
> columns are in the middle of the schema
> - the ORC file being read contains null values in the columns after the ones 
> added by the schema.
> When all of these are met:
> - the internal state of the ORC deserializer gets messed up, and, as a result
> - the null values from the ORC file end up being set on wrong columns, not 
> the one they're in, and
> - the old values from the null columns don't get cleared from the previous 
> record.
> Here's a concrete example. Let's consider the following DataFrame:
> {code:scala}
> val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), 
> (8, 9, null)))
> val df = rdd.toDF("col1", "col2", "col3")
> {code}
> and the following schema:
> {code:scala}
> col1 int, col4 int, col2 int, col3 string
> {code}
> Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
> Saving this dataframe to ORC and then reading it back with the specified 
> schema should result in reading the same values, with nulls for `col4`. 
> Instead, we get the following back:
> {code:java}
> [1,null,2,abc]
> [4,null,5,def]
> [8,null,null,def]
> {code}
> Notice how the `def` from the second record doesn't get properly cleared and 
> ends up in the third record as well; also, instead of `col2 = 9` in the last 
> record as expected, we get the null that should've been in column 3 instead.
> *Impact*
> When this issue is triggered, it results in completely wrong results being 
> read from the ORC file. The set of conditions under which it gets triggered 
> is somewhat narrow so the set of affected users is probably limited. There 
> are possibly also people that are affected but haven't realized it because 
> the conditions are so obscure.
> *Bug details*
> The issue is caused by calling `setNullAt` with a wrong index in 
> `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for 
> review shortly.
> *Workaround*
> This bug is currently only triggered when new columns are added to the middle 
> of the schema. This means that it can be worked around by only adding new 
> columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26859) Fix field writer index bug in non-vectorized ORC deserializer

2019-02-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26859:
---

Assignee: Ivan Vergiliev

> Fix field writer index bug in non-vectorized ORC deserializer
> -
>
> Key: SPARK-26859
> URL: https://issues.apache.org/jira/browse/SPARK-26859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ivan Vergiliev
>Assignee: Ivan Vergiliev
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.1, 3.0.0
>
>
> There is a bug in the ORC deserialization code that, when triggered, results 
> in completely wrong data being read. I've marked this as a Blocker as per the 
> docs in https://spark.apache.org/contributing.html as it's a data correctness 
> issue.
> The bug is triggered when the following set of conditions are all met:
> - the non-vectorized ORC reader is being used;
> - a schema is explicitly specified when reading the ORC file
> - the provided schema has columns not present in the ORC file, and these 
> columns are in the middle of the schema
> - the ORC file being read contains null values in the columns after the ones 
> added by the schema.
> When all of these are met:
> - the internal state of the ORC deserializer gets messed up, and, as a result
> - the null values from the ORC file end up being set on wrong columns, not 
> the one they're in, and
> - the old values from the null columns don't get cleared from the previous 
> record.
> Here's a concrete example. Let's consider the following DataFrame:
> {code:scala}
> val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), 
> (8, 9, null)))
> val df = rdd.toDF("col1", "col2", "col3")
> {code}
> and the following schema:
> {code:scala}
> col1 int, col4 int, col2 int, col3 string
> {code}
> Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
> Saving this dataframe to ORC and then reading it back with the specified 
> schema should result in reading the same values, with nulls for `col4`. 
> Instead, we get the following back:
> {code:java}
> [1,null,2,abc]
> [4,null,5,def]
> [8,null,null,def]
> {code}
> Notice how the `def` from the second record doesn't get properly cleared and 
> ends up in the third record as well; also, instead of `col2 = 9` in the last 
> record as expected, we get the null that should've been in column 3 instead.
> *Impact*
> When this issue is triggered, it results in completely wrong results being 
> read from the ORC file. The set of conditions under which it gets triggered 
> is somewhat narrow so the set of affected users is probably limited. There 
> are possibly also people that are affected but haven't realized it because 
> the conditions are so obscure.
> *Bug details*
> The issue is caused by calling `setNullAt` with a wrong index in 
> `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for 
> review shortly.
> *Workaround*
> This bug is currently only triggered when new columns are added to the middle 
> of the schema. This means that it can be worked around by only adding new 
> columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26943) Weird behaviour with `.cache()`

2019-02-20 Thread Will Uto (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Uto updated SPARK-26943:
-
Description: 
 
{code:java}
sdf.count(){code}
 

works fine. However:

 
{code:java}
sdf = sdf.cache()
sdf.count()

{code}
 does not, and produces error
{code:java}
Py4JJavaError: An error occurred while calling o314.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 in 
stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 (TID 
438, uat-datanode-02.mint.ukho.gov.uk, executor 1): java.text.ParseException: 
Unparseable number: "(N/A)"
at java.text.NumberFormat.parse(NumberFormat.java:350)
{code}

  was:
 
{code:java}
sdf.count(){code}
 

works fine. However:

 
{code:java}
sdf = sdf.cache()
sdf.count()

{code}
 


> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02.mint.ukho.gov.uk, executor 1): 
> java.text.ParseException: Unparseable number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26943) Weird behaviour with `.cache()`

2019-02-20 Thread Will Uto (JIRA)

Will Uto created SPARK-26943:


 Summary: Weird behaviour with `.cache()`
 Key: SPARK-26943
 URL: https://issues.apache.org/jira/browse/SPARK-26943
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Will Uto


 
{code:java}
sdf.count(){code}
 

works fine. However:

 
{code:java}
sdf = sdf.cache()
sdf.count()

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2019-02-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22601:
-

Assignee: Sujith

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Minor
> Fix For: 2.2.1
>
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26869) UDF with struct requires to have _1 and _2 as struct field names

2019-02-20 Thread Valeria Vasylieva (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772988#comment-16772988
 ] 

Valeria Vasylieva commented on SPARK-26869:
---

[~anddonram] you are trying to treat Struct as Tuple in udf, but even if you 
try to use Row/case class, it will also fail as it is not supported yet.

Try to look at [SPARK-12823|https://issues.apache.org/jira/browse/SPARK-12823], 
it seems to be related. Hope it helps.

 

> UDF with struct requires to have _1 and _2 as struct field names
> 
>
> Key: SPARK-26869
> URL: https://issues.apache.org/jira/browse/SPARK-26869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu 18.04.1 LTS
>Reporter: Andrés Doncel Ramírez
>Priority: Minor
>
> When using a UDF which has a Seq of tuples as input, the struct field names 
> need to match "_1" and "_2". The following code illustrates this:
>  
> {code:java}
> val df = sc.parallelize(Array(
>   ("1",3.0),
>   ("2",4.5),
>   ("5",2.0)
> )
> ).toDF("c1","c2")
> val df1=df.agg(collect_list(struct("c1","c2")).as("c3"))
> // Changing column names to _1 and _2 when creating the struct
> val 
> df2=df.agg(collect_list(struct(col("c1").as("_1"),col("c2").as("_2"))).as("c3"))
> def takeUDF = udf({ (xs: Seq[(String, Double)]) =>
>   xs.take(2)
> })
> df1.printSchema
> df2.printSchema
> df1.withColumn("c4",takeUDF(col("c3"))).show() // this fails
> df2.withColumn("c4",takeUDF(col("c3"))).show() // this works
> {code}
> The first one returns the following exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(c3)' due to data 
> type mismatch: argument 1 requires array> type, 
> however, '`c3`' is of array> type.;;
> While the second works as expected and prints the result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2019-02-20 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-24211.
--
   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 2.3.4

Resolved by https://github.com/apache/spark/pull/23757

> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.2
>Reporter: Dongjoon Hyun
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.3.4
>
>
> *windowed left outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/]
> *windowed right outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
> *left outer join with non-key condition violated*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/386/]
> *left outer early state exclusion on left*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/385/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2019-02-20 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772980#comment-16772980
 ] 

Takeshi Yamamuro commented on SPARK-24211:
--

I closed this cuz it seems these test failures don't happen in the recent 
branch-2.3 test runs.

Thanks [~kabhwan]!

> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.2
>Reporter: Dongjoon Hyun
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.3.4
>
>
> *windowed left outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/]
> *windowed right outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
> *left outer join with non-key condition violated*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/386/]
> *left outer early state exclusion on left*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/385/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26927) Race condition may cause dynamic allocation not working

2019-02-20 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26927:
-
Issue Type: Bug  (was: Improvement)

> Race condition may cause dynamic allocation not working
> ---
>
> Key: SPARK-26927
> URL: https://issues.apache.org/jira/browse/SPARK-26927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
> Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24239) Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from earliest offsets

2019-02-20 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772984#comment-16772984
 ] 

Takeshi Yamamuro commented on SPARK-24239:
--

I closed this cuz it seems these test failures don't happen in the recent 
branch-2.3 test runs.

Thanks [~kabhwan]!

> Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from 
> earliest offsets
> --
>
> Key: SPARK-24239
> URL: https://issues.apache.org/jira/browse/SPARK-24239
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.2
>Reporter: Dongjoon Hyun
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.3.4
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/360/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/353/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24239) Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from earliest offsets

2019-02-20 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-24239.
--
   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 2.3.4

Resolved by https://github.com/apache/spark/pull/23757

> Flaky test: KafkaContinuousSourceSuite.subscribing topic by name from 
> earliest offsets
> --
>
> Key: SPARK-24239
> URL: https://issues.apache.org/jira/browse/SPARK-24239
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.2
>Reporter: Dongjoon Hyun
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.3.4
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/360/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/353/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26942) spark v 2.3.2 test failure in hive module

2019-02-20 Thread ketan kunde (JIRA)

ketan kunde created SPARK-26942:
---

 Summary: spark v 2.3.2 test failure in hive module
 Key: SPARK-26942
 URL: https://issues.apache.org/jira/browse/SPARK-26942
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.3.2
 Environment: ub 16.04

8GB ram

2 core machine .. 

docker container
Reporter: ketan kunde


HI,

I have build spark 2.3.2 on big endian system.
I am now executing test cases in hive
i encounter issue related to ORC format on bigendian while runningtest "("test 
statistics of LogicalRelation converted from Hive serde tables")"
I want to know what is support of ORC serde on big endian system and if it is 
supported then whats the workaround to get this test fixed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26941) incorrect computation of maxNumExecutorFailures in ApplicationMaster for streaming

2019-02-20 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26941:
-
Summary: incorrect computation of maxNumExecutorFailures in 
ApplicationMaster for streaming   (was: maxNumExecutorFailures should be 
computed with spark.streaming.dynamicAllocation.maxExecutors in 
ApplicationMaster for streaming )

> incorrect computation of maxNumExecutorFailures in ApplicationMaster for 
> streaming 
> ---
>
> Key: SPARK-26941
> URL: https://issues.apache.org/jira/browse/SPARK-26941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, when enabled streaming dynamic allocation for streaming 
> applications, the maxNumExecutorFailures in ApplicationMaster is still 
> computed with `spark.dynamicAllocation.maxExecutors`. 
> Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` 
> instead.
> Related codes:
> {code:java}
> private val maxNumExecutorFailures = {
>   val effectiveNumExecutors =
> if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
> } else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
> } else {
>   sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
> }
>   // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation 
> is enabled. We need
>   // avoid the integer overflow here.
>   val defaultMaxNumExecutorFailures = math.max(3,
> if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else (2 * 
> effectiveNumExecutors))
>   
> sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26941) maxNumExecutorFailures should be computed with spark.streaming.dynamicAllocation.maxExecutors in ApplicationMaster for streaming

2019-02-20 Thread liupengcheng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26941:
-
Component/s: YARN
Summary: maxNumExecutorFailures should be computed with 
spark.streaming.dynamicAllocation.maxExecutors in ApplicationMaster for 
streaming   (was: maxNumExecutorFailures should be computed with 
spark.streaming.dynamicAllocation.maxExecutors in streaming )

> maxNumExecutorFailures should be computed with 
> spark.streaming.dynamicAllocation.maxExecutors in ApplicationMaster for 
> streaming 
> -
>
> Key: SPARK-26941
> URL: https://issues.apache.org/jira/browse/SPARK-26941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, when enabled streaming dynamic allocation for streaming 
> applications, the maxNumExecutorFailures in ApplicationMaster is still 
> computed with `spark.dynamicAllocation.maxExecutors`. 
> Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` 
> instead.
> Related codes:
> {code:java}
> private val maxNumExecutorFailures = {
>   val effectiveNumExecutors =
> if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
> } else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
> } else {
>   sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
> }
>   // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation 
> is enabled. We need
>   // avoid the integer overflow here.
>   val defaultMaxNumExecutorFailures = math.max(3,
> if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else (2 * 
> effectiveNumExecutors))
>   
> sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26941) maxNumExecutorFailures should be computed with spark.streaming.dynamicAllocation.maxExecutors in streaming

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26941:


Assignee: (was: Apache Spark)

> maxNumExecutorFailures should be computed with 
> spark.streaming.dynamicAllocation.maxExecutors in streaming 
> ---
>
> Key: SPARK-26941
> URL: https://issues.apache.org/jira/browse/SPARK-26941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, when enabled streaming dynamic allocation for streaming 
> applications, the maxNumExecutorFailures in ApplicationMaster is still 
> computed with `spark.dynamicAllocation.maxExecutors`. 
> Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` 
> instead.
> Related codes:
> {code:java}
> private val maxNumExecutorFailures = {
>   val effectiveNumExecutors =
> if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
> } else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
> } else {
>   sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
> }
>   // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation 
> is enabled. We need
>   // avoid the integer overflow here.
>   val defaultMaxNumExecutorFailures = math.max(3,
> if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else (2 * 
> effectiveNumExecutors))
>   
> sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26941) maxNumExecutorFailures should be computed with spark.streaming.dynamicAllocation.maxExecutors in streaming

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26941:


Assignee: Apache Spark

> maxNumExecutorFailures should be computed with 
> spark.streaming.dynamicAllocation.maxExecutors in streaming 
> ---
>
> Key: SPARK-26941
> URL: https://issues.apache.org/jira/browse/SPARK-26941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when enabled streaming dynamic allocation for streaming 
> applications, the maxNumExecutorFailures in ApplicationMaster is still 
> computed with `spark.dynamicAllocation.maxExecutors`. 
> Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` 
> instead.
> Related codes:
> {code:java}
> private val maxNumExecutorFailures = {
>   val effectiveNumExecutors =
> if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
> } else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
> } else {
>   sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
> }
>   // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation 
> is enabled. We need
>   // avoid the integer overflow here.
>   val defaultMaxNumExecutorFailures = math.max(3,
> if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else (2 * 
> effectiveNumExecutors))
>   
> sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26941) maxNumExecutorFailures should be computed with spark.streaming.dynamicAllocation.maxExecutors in streaming

2019-02-20 Thread liupengcheng (JIRA)

liupengcheng created SPARK-26941:


 Summary: maxNumExecutorFailures should be computed with 
spark.streaming.dynamicAllocation.maxExecutors in streaming 
 Key: SPARK-26941
 URL: https://issues.apache.org/jira/browse/SPARK-26941
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.1.0
Reporter: liupengcheng


Currently, when enabled streaming dynamic allocation for streaming 
applications, the maxNumExecutorFailures in ApplicationMaster is still computed 
with `spark.dynamicAllocation.maxExecutors`. 

Actually, we should consider `spark.streaming.dynamicAllocation.maxExecutors` 
instead.

Related codes:
{code:java}
private val maxNumExecutorFailures = {
  val effectiveNumExecutors =
if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
  sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
} else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
  sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
} else {
  sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
}
  // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation is 
enabled. We need
  // avoid the integer overflow here.
  val defaultMaxNumExecutorFailures = math.max(3,
if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else (2 * 
effectiveNumExecutors))

  sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path

2019-02-20 Thread Sujith (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772921#comment-16772921
 ] 

Sujith commented on SPARK-22601:


*[gatorsmile|https://github.com/gatorsmile] [~srowen]  please assign this JIRA 
to me as already this PR  is been merged. Thanks*

> Data load is getting displayed successful on providing non existing hdfs file 
> path
> --
>
> Key: SPARK-22601
> URL: https://issues.apache.org/jira/browse/SPARK-22601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sujith
>Priority: Minor
> Fix For: 2.2.1
>
>
> Data load is getting displayed successful on providing non existing hdfs file 
> path where as in local path proper error message is getting displayed
> create table tb2 (a string, b int);
>  load data inpath 'hdfs://hacluster/data1.csv' into table tb2
> Note:  data1.csv does not exist in HDFS
> when local non existing file path is given below error message will be 
> displayed
> "LOAD DATA input path does not exist". attached snapshots of behaviour in 
> spark 2.1 and spark 2.2 version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2019-02-20 Thread Prashant Sharma (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772875#comment-16772875
 ] 

Prashant Sharma commented on SPARK-24432:
-

Any update on this work?

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26901) Vectorized gapply should not prune columns

2019-02-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26901:
---

Assignee: Hyukjin Kwon

> Vectorized gapply should not prune columns
> --
>
> Key: SPARK-26901
> URL: https://issues.apache.org/jira/browse/SPARK-26901
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, if some columns can be pushed, it's being pushed through 
> {{FlatMapGroupsInRWithArrow}}.
> {code}
> explain(count(gapply(df,
>  "gear",
>  function(key, group) {
>data.frame(gear = key[[1]], disp = mean(group$disp))
>  },
>  structType("gear double, disp double"))), TRUE)
> {code}
> {code}
> *(4) HashAggregate(keys=[], functions=[count(1)], output=[count#64L])
> +- Exchange SinglePartition
>+- *(3) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#67L])
>   +- *(3) Project
>  +- FlatMapGroupsInRWithArrow [...]
> +- *(2) Sort [gear#9 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(gear#9, 200)
>   +- *(1) Project [gear#9]
>  +- *(1) Scan ExistingRDD 
> arrow[mpg#0,cyl#1,disp#2,hp#3,drat#4,wt#5,qsec#6,vs#7,am#8,gear#9,carb#10]
> {code}
> This causes to send corrupt values R workers when the R native functions are 
> executed.
> {code}
>   c(5, 5, 5, 5, 5)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 2.47032822920623e-323)
>   c(0, 0, 0, 0, 2.05578399548861e-314)
>   c(3.4483079184909e-313, 3.4483079184909e-313, 3.4483079184909e-313, 
> 5.31146529464635e-315, 0)
>   c(0, 0, 0, 0, -2.63230705887168e+228)
>   c(5, 5, 5, 0, 2.47032822920623e-323)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 0, 0, 4.1978645388e-314)
>   c(0, 0, 0, 0, -2.18328530492023e+219)
>   c(3.4483079184909e-313, 5.31146529464635e-315, 0, 0, -2.63230127529109e+228)
>   c(0, 0, 0, 0, 2.47032822920623e-323)
>   c(5, 0, 0, 0, 4.1978645388e-314)
>   c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 2.47032822920623e-323)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.05578399548861e-314)
>   c(3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 5.30757980430645e-315, 0)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.73302088532611e+228)
>   c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 2.47032822920623e-323)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 0, 0, 
> 4.1978645388e-314)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.04669129845114e+219)
>   c(3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 5.30757980430645e-315, 0, 0, 
> -2.73301510174552e+228)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.47032822920623e-323)
>   c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 4.1978645388e-314)
> {code}
> which should be:
> {code}
>   c(21, 21, 22.8, 24.4, 22.8, 19.2, 17.8, 32.4, 30.4, 33.9, 27.3, 21.4)
>   c(6, 6, 4, 4, 4, 6, 6, 4, 4, 4, 4, 4)
>   c(160, 160, 108, 146.7, 140.8, 167.6, 167.6, 78.7, 75.7, 71.1, 79, 121)
>   c(110, 110, 93, 62, 95, 123, 123, 66, 52, 65, 66, 109)
>   c(3.9, 3.9, 3.85, 3.69, 3.92, 3.92, 3.92, 4.08, 4.93, 4.22, 4.08, 4.11)
>   c(2.62, 2.875, 2.32, 3.19, 3.15, 3.44, 3.44, 2.2, 1.615, 1.835, 1.935, 2.78)
>   c(16.46, 17.02, 18.61, 20, 22.9, 18.3, 18.9, 19.47, 18.52, 19.9, 18.9, 18.6)
>   c(0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
>   c(1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1)
>   c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4)
>   c(4, 4, 1, 2, 2, 4, 4, 1, 2, 1, 1, 2)
>   c(26, 30.4, 15.8, 19.7, 15)
>   c(4, 4, 8, 6, 8)
>   c(120.3, 95.1, 351, 145, 301)
>   c(91, 113, 264, 175, 335)
>

[jira] [Resolved] (SPARK-26901) Vectorized gapply should not prune columns

2019-02-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26901.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23810
[https://github.com/apache/spark/pull/23810]

> Vectorized gapply should not prune columns
> --
>
> Key: SPARK-26901
> URL: https://issues.apache.org/jira/browse/SPARK-26901
> Project: Spark
>  Issue Type: Bug
>  Components: R, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, if some columns can be pushed, it's being pushed through 
> {{FlatMapGroupsInRWithArrow}}.
> {code}
> explain(count(gapply(df,
>  "gear",
>  function(key, group) {
>data.frame(gear = key[[1]], disp = mean(group$disp))
>  },
>  structType("gear double, disp double"))), TRUE)
> {code}
> {code}
> *(4) HashAggregate(keys=[], functions=[count(1)], output=[count#64L])
> +- Exchange SinglePartition
>+- *(3) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#67L])
>   +- *(3) Project
>  +- FlatMapGroupsInRWithArrow [...]
> +- *(2) Sort [gear#9 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(gear#9, 200)
>   +- *(1) Project [gear#9]
>  +- *(1) Scan ExistingRDD 
> arrow[mpg#0,cyl#1,disp#2,hp#3,drat#4,wt#5,qsec#6,vs#7,am#8,gear#9,carb#10]
> {code}
> This causes to send corrupt values R workers when the R native functions are 
> executed.
> {code}
>   c(5, 5, 5, 5, 5)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 2.47032822920623e-323)
>   c(0, 0, 0, 0, 2.05578399548861e-314)
>   c(3.4483079184909e-313, 3.4483079184909e-313, 3.4483079184909e-313, 
> 5.31146529464635e-315, 0)
>   c(0, 0, 0, 0, -2.63230705887168e+228)
>   c(5, 5, 5, 0, 2.47032822920623e-323)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 0, 0, 4.1978645388e-314)
>   c(0, 0, 0, 0, -2.18328530492023e+219)
>   c(3.4483079184909e-313, 5.31146529464635e-315, 0, 0, -2.63230127529109e+228)
>   c(0, 0, 0, 0, 2.47032822920623e-323)
>   c(5, 0, 0, 0, 4.1978645388e-314)
>   c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 2.47032822920623e-323)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.05578399548861e-314)
>   c(3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 5.30757980430645e-315, 0)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2.73302088532611e+228)
>   c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 2.47032822920623e-323)
>   c(7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 
> 7.90505033345994e-323, 7.90505033345994e-323, 7.90505033345994e-323, 0, 0, 
> 4.1978645388e-314)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.04669129845114e+219)
>   c(3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 3.4482690635875e-313, 
> 3.4482690635875e-313, 3.4482690635875e-313, 5.30757980430645e-315, 0, 0, 
> -2.73301510174552e+228)
>   c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2.47032822920623e-323)
>   c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 4.1978645388e-314)
> {code}
> which should be:
> {code}
>   c(21, 21, 22.8, 24.4, 22.8, 19.2, 17.8, 32.4, 30.4, 33.9, 27.3, 21.4)
>   c(6, 6, 4, 4, 4, 6, 6, 4, 4, 4, 4, 4)
>   c(160, 160, 108, 146.7, 140.8, 167.6, 167.6, 78.7, 75.7, 71.1, 79, 121)
>   c(110, 110, 93, 62, 95, 123, 123, 66, 52, 65, 66, 109)
>   c(3.9, 3.9, 3.85, 3.69, 3.92, 3.92, 3.92, 4.08, 4.93, 4.22, 4.08, 4.11)
>   c(2.62, 2.875, 2.32, 3.19, 3.15, 3.44, 3.44, 2.2, 1.615, 1.835, 1.935, 2.78)
>   c(16.46, 17.02, 18.61, 20, 22.9, 18.3, 18.9, 19.47, 18.52, 19.9, 18.9, 18.6)
>   c(0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
>   c(1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1)
>   c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4)
>   c(4, 4, 1, 2, 2, 4, 4,

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Summary: Observed greater deviation on big endian platform for 
SingletonReplSuite test case  (was: Observed greater deviation on big endian 
for SingletonReplSuite test case)

> Observed greater deviation on big endian platform for SingletonReplSuite test 
> case
> --
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module. I am facing failures at SingletonReplSuite with error 
> log as attached below 
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Description: 
I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module, I am facing failures at SingletonReplSuite with error 
log as attached.

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 

  was:
I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module. I am facing failures at SingletonReplSuite with error 
log as attached below 

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 


> Observed greater deviation on big endian platform for SingletonReplSuite test 
> case
> --
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module, I am facing failures at SingletonReplSuite with error 
> log as attached.
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Description: 
I have built Apache Spark v2.3.2 on Big Endian platform with AdoptJDK OpenJ9 
1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module, I am facing failures at SingletonReplSuite with error 
log as attached.

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 

  was:
I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module, I am facing failures at SingletonReplSuite with error 
log as attached.

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 


> Observed greater deviation on big endian platform for SingletonReplSuite test 
> case
> --
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian platform with AdoptJDK OpenJ9 
> 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module, I am facing failures at SingletonReplSuite with error 
> log as attached.
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Attachment: (was: failure_log)

> Observed greater deviation on big endian for SingletonReplSuite test case
> -
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module. I am facing failures at SingletonReplSuite with error 
> log as attached below 
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Attachment: failure_log.txt

> Observed greater deviation on big endian for SingletonReplSuite test case
> -
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module. I am facing failures at SingletonReplSuite with error 
> log as attached below 
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Attachment: failure_log

> Observed greater deviation on big endian for SingletonReplSuite test case
> -
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
> Attachments: failure_log
>
>
> I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module. I am facing failures at SingletonReplSuite with error 
> log as attached below 
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Description: 
I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module. I am facing failures at SingletonReplSuite with error 
log as attached below 

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 

  was:
I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module. I am facing failures at SingletonReplSuite with error 
log as below 

 
 - should clone and clean line object in ClosureCleaner *** FAILED ***
 isContain was true Interpreter output contained 'AssertionError':

scala> import org.apache.spark.rdd.RDD

scala>
 scala> lines: org.apache.spark.rdd.RDD[String] = pom.xml MapPartitionsRDD[46] 
at textFile at :40

scala> defined class Data

scala> dataRDD: org.apache.spark.rdd.RDD[Data] = MapPartitionsRDD[47] at map at 
:43

scala> res28: Long = 180

scala> repartitioned: org.apache.spark.rdd.RDD[Data] = MapPartitionsRDD[51] at 
repartition at :41

scala> res29: Long = 180

scala>
 scala> | | getCacheSize: (rdd: org.apache.spark.rdd.RDD[_])Long

scala> cacheSize1: Long = 24608

scala> cacheSize2: Long = 17768

scala>
 scala>
 scala> deviation: Double = 0.2779583875162549

scala> | java.lang.AssertionError: assertion failed: deviation too large: 
0.2779583875162549, first size: 24608, second size: 17768
 at scala.Predef$.assert(Predef.scala:170)
 ... 46 elided

scala> | _result_1550641172995: Int = 1

scala> (SingletonReplSuite.scala:121)

 

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 

Summary: Observed greater deviation on big endian for 
SingletonReplSuite test case  (was: Observed greater deviation Big Endian for 
SingletonReplSuite test case)

> Observed greater deviation on big endian for SingletonReplSuite test case
> -
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Major
>
> I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module. I am facing failures at SingletonReplSuite with error 
> log as attached below 
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Nandor Kollar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772829#comment-16772829
 ] 

Nandor Kollar commented on SPARK-26930:
---

I don't know either, but I feel that the second option might be simpler to 
implement. I can give a shot.

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26940) Observed greater deviation Big Endian for SingletonReplSuite test case

2019-02-20 Thread Anuja Jakhade (JIRA)

Anuja Jakhade created SPARK-26940:
-

 Summary: Observed greater deviation Big Endian for 
SingletonReplSuite test case
 Key: SPARK-26940
 URL: https://issues.apache.org/jira/browse/SPARK-26940
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.3.2
 Environment: Ubuntu 16.04 LTS

openjdk version "1.8.0_202"
OpenJDK Runtime Environment (build 1.8.0_202-b08)
Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
enabled)
OpenJ9 - 90dd8cb40
OMR - d2f4534b
JCL - d002501a90 based on jdk8u202-b08)
Reporter: Anuja Jakhade


I have built Apache Spark v2.3.2 on Big Endian with AdoptJDK OpenJ9 1.8.0_202.

My build is successful. However while running the scala tests of "*Spark 
Project REPL*" module. I am facing failures at SingletonReplSuite with error 
log as below 

 
 - should clone and clean line object in ClosureCleaner *** FAILED ***
 isContain was true Interpreter output contained 'AssertionError':

scala> import org.apache.spark.rdd.RDD

scala>
 scala> lines: org.apache.spark.rdd.RDD[String] = pom.xml MapPartitionsRDD[46] 
at textFile at :40

scala> defined class Data

scala> dataRDD: org.apache.spark.rdd.RDD[Data] = MapPartitionsRDD[47] at map at 
:43

scala> res28: Long = 180

scala> repartitioned: org.apache.spark.rdd.RDD[Data] = MapPartitionsRDD[51] at 
repartition at :41

scala> res29: Long = 180

scala>
 scala> | | getCacheSize: (rdd: org.apache.spark.rdd.RDD[_])Long

scala> cacheSize1: Long = 24608

scala> cacheSize2: Long = 17768

scala>
 scala>
 scala> deviation: Double = 0.2779583875162549

scala> | java.lang.AssertionError: assertion failed: deviation too large: 
0.2779583875162549, first size: 24608, second size: 17768
 at scala.Predef$.assert(Predef.scala:170)
 ... 46 elided

scala> | _result_1550641172995: Int = 1

scala> (SingletonReplSuite.scala:121)

 

The deviation observed on big endian is greater than the acceptable deviation 
0.2.

How efficient is it to increase the deviation defined in 
SingletonReplSuite.scala

Can this be fixed? 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772826#comment-16772826
 ] 

Hyukjin Kwon commented on SPARK-26930:
--

I am not sure which way will be minimised and simple way. Since it's a test 
code, any minimised way would be preferred.

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Nandor Kollar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772823#comment-16772823
 ] 

Nandor Kollar commented on SPARK-26930:
---

What do you think is the better approach? Test for the entire expression 
(verify for and(null check, filter)), or just simply search for the class in 
the expression tree (I think the relevant filter should be somewhere in the 
leaf nodes).

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772812#comment-16772812
 ] 

Hyukjin Kwon commented on SPARK-26930:
--

Yea, {{IsNotNull}} will be inserted (see also 
https://github.com/apache/spark/commit/ef77003178eb5cdcb4fe519fc540917656c5d577).
 Looks we should fix anyway.

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Nandor Kollar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772806#comment-16772806
 ] 

Nandor Kollar commented on SPARK-26930:
---

Thanks [~hyukjin.kwon] for taking a look at this Jira. That's correct, an 
assert is missing from there, but an other problem is, that if I put it there, 
the tests will fail. They won't fail because of a bug in the production code, 
but because the expression to test for are not just simple eq, lt etc. 
predicated, but more complex expressions like and(eq, lt). I guess the intent 
with that exists call was to find the relevant class in this expression, but it 
seems that optional's exists doesn't do this, one should manually iterate 
through the expression tree. And sorry about the ambiguous description! :) 
Anyway I don't think this is a serious issue, though it would be nice to 
improve these tests.

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772795#comment-16772795
 ] 

Hyukjin Kwon edited comment on SPARK-26930 at 2/20/19 9:15 AM:
---

Ah, gotya {{maybeFilter.exists(_.getClass === filterClass)}} doesn't check 
anything and {{assert}} should be added in {{ParquetFilterSuite.scala}} s 117 
line


was (Author: hyukjin.kwon):
Ah, gotya {{maybeFilter.exists(_.getClass === filterClass)}} doesn't check 
anything and {{assert}} should be added in {{ParquetFilterSuite.scala}}s 117 
line

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772795#comment-16772795
 ] 

Hyukjin Kwon commented on SPARK-26930:
--

Ah, gotya {{maybeFilter.exists(_.getClass === filterClass)}} doesn't check 
anything and {{assert}} should be added in {{ParquetFilterSuite.scala}}s 117 
line

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26939) Fix some outdated comments about task schedulers

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26939:


Assignee: Apache Spark

> Fix some outdated comments about task schedulers
> 
>
> Key: SPARK-26939
> URL: https://issues.apache.org/jira/browse/SPARK-26939
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Minor
>
> Some comments about task schedulers are outdated. They should be fixed.
> * TaskScheduler comments: currently implemented exclusively by 
>   org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now.
> * YarnClusterScheduler comments: reference to ClusterScheduler which is not 
> used anymore.
> * TaskSetManager comments: method statusUpdate does not exist as of now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26939) Fix some outdated comments about task schedulers

2019-02-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26939:


Assignee: (was: Apache Spark)

> Fix some outdated comments about task schedulers
> 
>
> Key: SPARK-26939
> URL: https://issues.apache.org/jira/browse/SPARK-26939
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> Some comments about task schedulers are outdated. They should be fixed.
> * TaskScheduler comments: currently implemented exclusively by 
>   org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now.
> * YarnClusterScheduler comments: reference to ClusterScheduler which is not 
> used anymore.
> * TaskSetManager comments: method statusUpdate does not exist as of now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26939) Fix some outdated comments about task schedulers

2019-02-20 Thread Chenxiao Mao (JIRA)

Chenxiao Mao created SPARK-26939:


 Summary: Fix some outdated comments about task schedulers
 Key: SPARK-26939
 URL: https://issues.apache.org/jira/browse/SPARK-26939
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Some comments about task schedulers are outdated. They should be fixed.

* TaskScheduler comments: currently implemented exclusively by 
  org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now.
* YarnClusterScheduler comments: reference to ClusterScheduler which is not 
used anymore.
* TaskSetManager comments: method statusUpdate does not exist as of now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-20 Thread Nandor Kollar (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated SPARK-26930:
--
Summary: Tests in ParquetFilterSuite don't verify filter class  (was: 
Several test cases in ParquetFilterSuite are broken)

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

97 matches

Mail list logo