date:20190221

[jira] [Commented] (SPARK-26962) Windows Function LEAD in Spark SQL is not fetching consistent results.

2019-02-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774865#comment-16774865
 ] 

Hyukjin Kwon commented on SPARK-26962:
--

Are you able to show the results in the JIRA description? It would be awesome 
to narrow down and find the root condition that it reads the same data as well.

> Windows Function LEAD in Spark SQL is not fetching consistent results.
> --
>
> Key: SPARK-26962
> URL: https://issues.apache.org/jira/browse/SPARK-26962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Shiva Sankari Perambalam
>Priority: Major
>
> Using a Lead function on a DATETIME column is giving inconsistent results in 
> Spark sql.
> {code:java}
> Lead(date) over (partition by id, code order by date){code}
> where Date is DATETIME, id and code a String.
> {code:java}
> val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
> id, code order by date) as lead_date from foo"""){code}
> The result set is sometimes having the same data as the date instead of the 
> lead_date
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26951) Should not throw KryoException when root cause is IOexception

2019-02-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774864#comment-16774864
 ] 

Hyukjin Kwon commented on SPARK-26951:
--

Can you describe why it should retry?

> Should not throw KryoException when root cause is IOexception
> -
>
> Key: SPARK-26951
> URL: https://issues.apache.org/jira/browse/SPARK-26951
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Job will failed with below exception:
> {code:java}
> Job aborted due to stage failure: Task 1576 in stage 97.0 failed 4 times, 
> most recent failure: Lost task 1576.3 in stage 97.0 (TID 121949, xxx, 
> executor 14): com.esotericsoftware.kryo.KryoException: java.io.IOException: 
> Stream is corrupted. The lz4's magic number should be 
> LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
> ().
> {code}
> {code:java}
> Job aborted due to stage failure: Task 1576 in stage 97.0 failed 4 times, 
> most recent failure: Lost task 1576.3 in stage 97.0 (TID 121949, xxx, 
> executor 14): com.esotericsoftware.kryo.KryoException: java.io.IOException: 
> Stream is corrupted. The lz4's magic number should be 
> LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
> ().
>   at com.esotericsoftware.kryo.io.Input.fill(Input.java:166)
>   at com.esotericsoftware.kryo.io.Input.require(Input.java:196)
>   at com.esotericsoftware.kryo.io.Input.readVarInt(Input.java:373)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:127)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:244)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:180)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Stream is corrupted. The lz4's magic number 
> should be LZ4Block(4c5a34426c6f636b), but received buffer's head bytes is 
> ().
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:169)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:127)
>   at com.esotericsoftware.kryo.io.Input.fill(Input.java:164)
>   ... 19 more
> Driver stacktrace:
> {code}
> For IOException, it should retry



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774859#comment-16774859
 ] 

Hyukjin Kwon commented on SPARK-26964:
--

I know it's practically fine that JSON is pretty good to store as binary or 
string column, it's fine. I want to be very sure primitive support is something 
absolutely required and useful.

{quote}
Looking at the source code, it seems like all of these types have support in 
JacksonGenerator and JacksonParser, and so most of the work will be surfacing 
that, rather than entirely new code. Is there something you expect to be more 
intricate than additions to JsonToStructs and StructsToJson (and tests)? I'm 
considering having a look at this myself, but if your intuition implies that 
this is going to be a dead end, I will not.
{quote}

The core logic itself can be reused but surfacing the codes is a problem. By 
exposing primitive array, map, the community faced a lot of corner case 
problems. For instance, about how we're going to handle corrupt record (Spark 
provides some options to handle those records). One PR had to be reverted 
lately, for instance, see https://github.com/apache/spark/pull/23665 .  I guess 
it still needs considerable amount of codes. (see when we added MapType into 
one of both functions ,https://github.com/apache/spark/pull/18875) . One thing 
I am pretty sure of is that It would need some efforts to make the codes and 
get this into codebase - so I am being cautious here. 



> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26930.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23855
[https://github.com/apache/spark/pull/23855]

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 3.0.0
>
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-21 Thread Alessandro Bellina (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774800#comment-16774800
 ] 

Alessandro Bellina commented on SPARK-26944:


[~shivuson...@gmail.com] [~hyukjin.kwon] I added a link to the description. I 
think we should look at the Jenkins configuration to see what is getting copied 
to artifacts, though I believe you need special access for that.

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26930:


Assignee: Nandor Kollar

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-21 Thread Alessandro Bellina (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Bellina updated SPARK-26944:
---
Description: 
I had a pr where the python unit tests failed.  The tests point at the 
`/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
but I can't get to that from jenkins UI it seems (are all prs writing to the 
same file?).
{code:java}

Running PySpark tests

Running PySpark tests. Output is in 
/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
For reference, please see this build: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console

This Jira is to make it available under the artifacts for each build.

  was:
I had a pr where the python unit tests failed.  The tests point at the 
`/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
but I can't get to that from jenkins UI it seems (are all prs writing to the 
same file?).

This Jira is to make it available under the artifacts for each build.


> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2019-02-21 Thread Udbhav Agrawal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774789#comment-16774789
 ] 

Udbhav Agrawal commented on SPARK-26365:


can you specify some scenarios because i tried the same and i am able to get 
correct exit codes.

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Oscar Bonilla
>Priority: Minor
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Huon Wilson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774783#comment-16774783
 ] 

Huon Wilson commented on SPARK-26964:
-

We wish to store columns as columns within a {{binary}}-based database (HBase), 
meaning encoding individual fields. JSON represents a non-horrible way of 
encoding values into the database, e.g. it allows handling from many 
languages/environments (and is even human readable), and is very convenient to 
handle with DataFrames. I don't know of another way to extract byte 
representations of individual columns that satisfies those constraints.

Looking at the source code, it seems like all of these types have support in 
JacksonGenerator and JacksonParser, and so most of the work will be surfacing 
that, rather than entirely new code. Is there something you expect to be more 
intricate than additions to JsonToStructs and StructsToJson (and tests)? I'm 
considering having a look at this myself, but if your intuition implies that 
this is going to be a dead end, I will not.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26942) spark v 2.3.2 test failure in hive module

2019-02-21 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26942.
--
Resolution: Invalid

> spark v 2.3.2 test failure in hive module
> -
>
> Key: SPARK-26942
> URL: https://issues.apache.org/jira/browse/SPARK-26942
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.3.2
> Environment: ub 16.04
> 8GB ram
> 2 core machine .. 
> docker container
>Reporter: ketan kunde
>Priority: Major
>
> HI,
> I have build spark 2.3.2 on big endian system.
> I am now executing test cases in hive
> i encounter issue related to ORC format on bigendian while runningtest 
> "("test statistics of LogicalRelation converted from Hive serde tables")"
> I want to know what is support of ORC serde on big endian system and if it is 
> supported then whats the workaround to get this test fixed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26942) spark v 2.3.2 test failure in hive module

2019-02-21 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774780#comment-16774780
 ] 

Takeshi Yamamuro commented on SPARK-26942:
--

Could you ask first in the spark mailing list? If you could narrow down the 
issue, feel free to reopen it. (you need to describe more though)

> spark v 2.3.2 test failure in hive module
> -
>
> Key: SPARK-26942
> URL: https://issues.apache.org/jira/browse/SPARK-26942
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.3.2
> Environment: ub 16.04
> 8GB ram
> 2 core machine .. 
> docker container
>Reporter: ketan kunde
>Priority: Major
>
> HI,
> I have build spark 2.3.2 on big endian system.
> I am now executing test cases in hive
> i encounter issue related to ORC format on bigendian while runningtest 
> "("test statistics of LogicalRelation converted from Hive serde tables")"
> I want to know what is support of ORC serde on big endian system and if it is 
> supported then whats the workaround to get this test fixed?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26965) Makes ElementAt nullability more precise

2019-02-21 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-26965:


 Summary: Makes ElementAt nullability more precise
 Key: SPARK-26965
 URL: https://issues.apache.org/jira/browse/SPARK-26965
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takeshi Yamamuro


In master, ElementAt nullable is always true;
https://github.com/apache/spark/blob/be1cadf16dc70e22eae144b3dfce9e269ef95acc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1977

But, If input key is foldable, we could make its nullability more precise.
This fix is the same with SPARK-26747 and -SPARK-26637- 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774768#comment-16774768
 ] 

Hyukjin Kwon commented on SPARK-26944:
--

Yes, can you point out your PR?

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774767#comment-16774767
 ] 

Hyukjin Kwon commented on SPARK-26964:
--

Can you describe the usecase in more details? Adding primitive types there 
requires considerable amount of codes to maintain. I want to see how much it's 
worth.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values

2019-02-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26950:
---

Assignee: Dongjoon Hyun

> Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
> ---
>
> Key: SPARK-26950
> URL: https://issues.apache.org/jira/browse/SPARK-26950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.4, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, 
> but there exists more NaN values with different binary presentations.
> {code}
> scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
> res1: Array[Byte] = Array(127, -64, 0, 0)
> scala> val x = java.lang.Float.intBitsToFloat(-6966608)
> x: Float = NaN
> scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
> res2: Array[Byte] = Array(-1, -107, -78, -80)
> {code}
> `RandomDataGenerator` generates these NaN values. It's good, but it causes 
> `checkEvaluationWithUnsafeProjection` failures due to the difference between 
> `UnsafeRow` binary presentation. The following is the UT failure instance. 
> This issue aims to fix this flakiness.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/
> {code}
> Failed
> org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema 
> struct
>  with seed -81044812370056695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26950) Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values

2019-02-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26950.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23851
[https://github.com/apache/spark/pull/23851]

> Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values
> ---
>
> Key: SPARK-26950
> URL: https://issues.apache.org/jira/browse/SPARK-26950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.4, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>
> Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, 
> but there exists more NaN values with different binary presentations.
> {code}
> scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
> res1: Array[Byte] = Array(127, -64, 0, 0)
> scala> val x = java.lang.Float.intBitsToFloat(-6966608)
> x: Float = NaN
> scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
> res2: Array[Byte] = Array(-1, -107, -78, -80)
> {code}
> `RandomDataGenerator` generates these NaN values. It's good, but it causes 
> `checkEvaluationWithUnsafeProjection` failures due to the difference between 
> `UnsafeRow` binary presentation. The following is the UT failure instance. 
> This issue aims to fix this flakiness.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/
> {code}
> Failed
> org.apache.spark.sql.avro.AvroCatalystDataConversionSuite.flat schema 
> struct
>  with seed -81044812370056695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-02-21 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774750#comment-16774750
 ] 

Sean Owen commented on SPARK-26881:
---

I wouldn't make it configurable. I don't think it makes sense to set it such 
that the collect size is any smaller than the max driver result size, nor 
bigger. It's already kind of configurable by the max result size, although its 
meaning is only indirectly related.

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25097) Support prediction on single instance in KMeans/BiKMeans/GMM

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25097:
-

Assignee: zhengruifeng

> Support prediction on single instance in KMeans/BiKMeans/GMM
> 
>
> Key: SPARK-25097
> URL: https://issues.apache.org/jira/browse/SPARK-25097
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> I just encounter a case that I need to apply existing KMeansModel out of 
> Spark, so the prediction on single instance should be exposed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25097) Support prediction on single instance in KMeans/BiKMeans/GMM

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25097.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22087
[https://github.com/apache/spark/pull/22087]

> Support prediction on single instance in KMeans/BiKMeans/GMM
> 
>
> Key: SPARK-25097
> URL: https://issues.apache.org/jira/browse/SPARK-25097
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> I just encounter a case that I need to apply existing KMeansModel out of 
> Spark, so the prediction on single instance should be exposed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26955) Align Spark's TimSort to JDK 11 TimSort

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26955:
-

Assignee: Maxim Gekk

> Align Spark's TimSort to JDK 11 TimSort
> ---
>
> Key: SPARK-26955
> URL: https://issues.apache.org/jira/browse/SPARK-26955
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> There are a couple differences in Spark's TimSort and JDK 11 TimSort:
> - Additional check in mergeCollapse introduced by 
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239#l1.34
> - And increased constants for stackLen 
> (http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/TimSort.java#l184):
> The ticket aims to address the differences.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26955) Align Spark's TimSort to JDK 11 TimSort

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26955.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23858
[https://github.com/apache/spark/pull/23858]

> Align Spark's TimSort to JDK 11 TimSort
> ---
>
> Key: SPARK-26955
> URL: https://issues.apache.org/jira/browse/SPARK-26955
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> There are a couple differences in Spark's TimSort and JDK 11 TimSort:
> - Additional check in mergeCollapse introduced by 
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239#l1.34
> - And increased constants for stackLen 
> (http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/TimSort.java#l184):
> The ticket aims to address the differences.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26966) Update JPMML to 1.4.8 for Java 9+

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26966:


Assignee: Apache Spark  (was: Sean Owen)

> Update JPMML to 1.4.8 for Java 9+
> -
>
> Key: SPARK-26966
> URL: https://issues.apache.org/jira/browse/SPARK-26966
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures 
> from JPMML relating to JAXB when running on Java 11. It's shaded and not a 
> big change, so should be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26966) Update JPMML to 1.4.8 for Java 9+

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26966:


Assignee: Sean Owen  (was: Apache Spark)

> Update JPMML to 1.4.8 for Java 9+
> -
>
> Key: SPARK-26966
> URL: https://issues.apache.org/jira/browse/SPARK-26966
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures 
> from JPMML relating to JAXB when running on Java 11. It's shaded and not a 
> big change, so should be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26965) Makes ElementAt nullability more precise

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26965:


Assignee: (was: Apache Spark)

> Makes ElementAt nullability more precise
> 
>
> Key: SPARK-26965
> URL: https://issues.apache.org/jira/browse/SPARK-26965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In master, ElementAt nullable is always true;
> https://github.com/apache/spark/blob/be1cadf16dc70e22eae144b3dfce9e269ef95acc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1977
> But, If input key is foldable, we could make its nullability more precise.
> This fix is the same with SPARK-26747 and -SPARK-26637- 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26965) Makes ElementAt nullability more precise

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26965:


Assignee: Apache Spark

> Makes ElementAt nullability more precise
> 
>
> Key: SPARK-26965
> URL: https://issues.apache.org/jira/browse/SPARK-26965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In master, ElementAt nullable is always true;
> https://github.com/apache/spark/blob/be1cadf16dc70e22eae144b3dfce9e269ef95acc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1977
> But, If input key is foldable, we could make its nullability more precise.
> This fix is the same with SPARK-26747 and -SPARK-26637- 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26966) Update JPMML to 1.4.8 for Java 9+

2019-02-21 Thread Sean Owen (JIRA)

Sean Owen created SPARK-26966:
-

 Summary: Update JPMML to 1.4.8 for Java 9+
 Key: SPARK-26966
 URL: https://issues.apache.org/jira/browse/SPARK-26966
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures 
from JPMML relating to JAXB when running on Java 11. It's shaded and not a big 
change, so should be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23232) Mapping Dataset to a Java bean always set 1L to a long field

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23232:
--
Priority: Major  (was: Critical)

So in what sense is it always 1?

> Mapping Dataset to a Java bean always set 1L to a long field
> 
>
> Key: SPARK-23232
> URL: https://issues.apache.org/jira/browse/SPARK-23232
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Hristo Angelov
>Priority: Major
>
> I have the following streaming query: 
> {code:java}
> baseDataSet
> .groupBy(window(col(UTC_DATE_TIME), 
> applicationProperties.getProperty("current_active_users_window_length") + " 
> minutes", "5 seconds"))
> .agg(approx_count_distinct(col(INTERNAL_USER_ID), 
> applicationProperties.getDoubleProperty("approximate_distinct_count_error_percentage")).as("value"))
> .filter(col("window.end").leq(current_timestamp()))
> .select(unix_timestamp(col("window.end")).as("timestamp"), 
> col("value"))
> .writeStream()
> 
> .trigger(Trigger.ProcessingTime(applicationProperties.getIntegerProperty("current_active_users_trigger_interval"),
>  TimeUnit.SECONDS))
> .format(ActiveUsersSinkProvider.class.getCanonicalName())
> .outputMode(OutputMode.Update())
> .option("checkpointLocation", SystemProperties.APP_CHECKPOINT_DIR + 
> "/current_active_users")
> .start();{code}
>  
> In the sink I'm trying to map the dataset to a Java bean with the following 
> code:
> {code:java}
> data.as(Encoders.bean(LongTimeBased.class)).collectAsList()
> {code}
> where LongTimeBased is:
> {code:java}
> public class LongTimeBased { 
>private long timestamp; 
>private long value; 
>  
>public long getTimestamp() {   
>   return timestamp; 
>} 
>public void setTimestamp(long timestamp) { 
>   this.timestamp = timestamp; 
>} 
>public long getValue() { 
>   return value; 
>} 
>public void setValue(long value) { 
>   this.value = value; 
>} 
> }
> {code}
>  
> So whatever data is aggregated the timestamp is correct but the value field 
> is always 1. When I select the value field from every row, its value is 
> correct:
> {code:java}
> for(Row row : data.collectAsList()) {
> Long value = row.getAs("value"); //correct value;
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22286) OutOfMemoryError caused by memory leak and large serializer batch size in ExternalAppendOnlyMap

2019-02-21 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774743#comment-16774743
 ] 

Sean Owen commented on SPARK-22286:
---

[~toopt4] what are you pinging? I'm not even clear this is a Spark issue

> OutOfMemoryError caused by memory leak and large serializer batch size in 
> ExternalAppendOnlyMap
> ---
>
> Key: SPARK-22286
> URL: https://issues.apache.org/jira/browse/SPARK-22286
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Priority: Critical
>
> *[Abstract]* 
> I recently encountered an OOM error in a simple _groupByKey_ application. 
> After profiling the application, I found the OOM error is related to the 
> shuffle spill and records (de)serialization. After analyzing the OOM heap 
> dump, I found the root causes are (1) memory leak in ExternalAppendOnlyMap, 
> (2) large static serializer batch size (_spark.shuffle.spill.batchSize_ 
> =1) defined in ExternalAppendOnlyMap, and (3) memory leak in the 
> deserializer. Since almost all the Spark applications rely on 
> ExternalAppendOnlyMap to perform shuffle and reduce, this is a critical 
> bug/defect. In the following sections, I will detail the testing application, 
> data, environment, failure symptoms, diagnosing procedure, identified root 
> causes, and potential solutions.
> *[Application]* 
> This is a simple GroupBy application as follows.
> {code}
> table.map(row => (row.sourceIP[1,7], row)).groupByKey().saveAsTextFile()
> {code}
> The _sourceIP_ (an IP address like 127.100.101.102) is a column of the 
> _UserVisits_ table. This application has the same logic as the aggregation 
> query in Berkeley SQL benchmark (https://amplab.cs.berkeley.edu/benchmark/) 
> as follows. 
> {code}
>   SELECT * FROM UserVisits
>   GROUP BY SUBSTR(sourceIP, 1, 7);
> {code}
> The application code is available at \[1\].
> *[Data]* 
> The UserVisits table size is 16GB (9 columns, 132,000,000 rows) with uniform 
> distribution. The HDFS block size is 128MB. The data generator is available 
> at \[2\].
> *[Environment]* 
> Spark 2.1 (Spark 2.2 may also have this error), Oracle Java Hotspot 1.8.0, 1 
> master and 8 workers as follows.
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Workers.png|width=100%!
> This application launched 32 executors. Each executor has 1 core and 7GB 
> memory. The detailed application configuration is
> {code}
>total-executor-cores = 32
>executor-cores = 1 
>executor-memory = 7G
>spark.default.parallelism=32 
>spark.serializer = JavaSerializer (KryoSerializer also has OOM error)
> {code}
> *[Failure symptoms]*
> This application has a map stage and a reduce stage. An OOM error occurs in a 
> reduce task (Task-17) as follows.
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Stage.png|width=100%!
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Tasks.png|width=100%!
>  
> Task-17 generated an OOM error. It shuffled ~1GB data and spilled 3.6GB data 
> onto the disk.
> Task-17 log below shows that this task is reading the next record by invoking 
> _ExternalAppendOnlyMap.hasNext_(). From the OOM stack traces and the above 
> shuffle metrics, we cannot identify the OOM root causes. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/OOMStackTrace.png|width=100%!
>  
> A question is that why Task-17 still suffered OOM errors even after spilling 
> large in-memory data onto the disk.
> *[Diagnosing procedure]*
> Since each executor has 1 core and 7GB, it runs only one task at a time and 
> the task memory usage is going to exceed 7GB.
> *1: Identify the error phase*
> I added some debug logs in Spark, and found that the error phase is not the 
> spill phase but the memory-disk-merge phase. 
> The memory-disk-merge phase: Spark reads back the spilled records (as shown 
> in ① Figure 1), merges the spilled records with the in-memory records  (as 
> shown in ②), generates new records, and output the new records onto HDFS (as 
> shown in ③).
> *2. Dataflow and memory usage analysis*
> I added some profiling code and obtained dataflow and memory usage metrics as 
> follows. Ki represents the _i_-th key, Ri represents the _i_-th row in the 
> table.
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/DataflowAndMemoryUsage.png|width=100%!
>   Figure 1: Dataflow and Memory Usage Analysis (see 
> https://github.com/JerryLead/Misc/blob/master/SparkPRFigures/OOM/SPARK-22286-OOM.pdf
>  for the high-definition version)
> The concrete phases with metrics are as follows.
> *[Shuffle read]* records = 7,540,235,

[jira] [Updated] (SPARK-22286) OutOfMemoryError caused by memory leak and large serializer batch size in ExternalAppendOnlyMap

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22286:
--
Priority: Major  (was: Critical)

> OutOfMemoryError caused by memory leak and large serializer batch size in 
> ExternalAppendOnlyMap
> ---
>
> Key: SPARK-22286
> URL: https://issues.apache.org/jira/browse/SPARK-22286
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Priority: Major
>
> *[Abstract]* 
> I recently encountered an OOM error in a simple _groupByKey_ application. 
> After profiling the application, I found the OOM error is related to the 
> shuffle spill and records (de)serialization. After analyzing the OOM heap 
> dump, I found the root causes are (1) memory leak in ExternalAppendOnlyMap, 
> (2) large static serializer batch size (_spark.shuffle.spill.batchSize_ 
> =1) defined in ExternalAppendOnlyMap, and (3) memory leak in the 
> deserializer. Since almost all the Spark applications rely on 
> ExternalAppendOnlyMap to perform shuffle and reduce, this is a critical 
> bug/defect. In the following sections, I will detail the testing application, 
> data, environment, failure symptoms, diagnosing procedure, identified root 
> causes, and potential solutions.
> *[Application]* 
> This is a simple GroupBy application as follows.
> {code}
> table.map(row => (row.sourceIP[1,7], row)).groupByKey().saveAsTextFile()
> {code}
> The _sourceIP_ (an IP address like 127.100.101.102) is a column of the 
> _UserVisits_ table. This application has the same logic as the aggregation 
> query in Berkeley SQL benchmark (https://amplab.cs.berkeley.edu/benchmark/) 
> as follows. 
> {code}
>   SELECT * FROM UserVisits
>   GROUP BY SUBSTR(sourceIP, 1, 7);
> {code}
> The application code is available at \[1\].
> *[Data]* 
> The UserVisits table size is 16GB (9 columns, 132,000,000 rows) with uniform 
> distribution. The HDFS block size is 128MB. The data generator is available 
> at \[2\].
> *[Environment]* 
> Spark 2.1 (Spark 2.2 may also have this error), Oracle Java Hotspot 1.8.0, 1 
> master and 8 workers as follows.
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Workers.png|width=100%!
> This application launched 32 executors. Each executor has 1 core and 7GB 
> memory. The detailed application configuration is
> {code}
>total-executor-cores = 32
>executor-cores = 1 
>executor-memory = 7G
>spark.default.parallelism=32 
>spark.serializer = JavaSerializer (KryoSerializer also has OOM error)
> {code}
> *[Failure symptoms]*
> This application has a map stage and a reduce stage. An OOM error occurs in a 
> reduce task (Task-17) as follows.
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Stage.png|width=100%!
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/Tasks.png|width=100%!
>  
> Task-17 generated an OOM error. It shuffled ~1GB data and spilled 3.6GB data 
> onto the disk.
> Task-17 log below shows that this task is reading the next record by invoking 
> _ExternalAppendOnlyMap.hasNext_(). From the OOM stack traces and the above 
> shuffle metrics, we cannot identify the OOM root causes. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/OOMStackTrace.png|width=100%!
>  
> A question is that why Task-17 still suffered OOM errors even after spilling 
> large in-memory data onto the disk.
> *[Diagnosing procedure]*
> Since each executor has 1 core and 7GB, it runs only one task at a time and 
> the task memory usage is going to exceed 7GB.
> *1: Identify the error phase*
> I added some debug logs in Spark, and found that the error phase is not the 
> spill phase but the memory-disk-merge phase. 
> The memory-disk-merge phase: Spark reads back the spilled records (as shown 
> in ① Figure 1), merges the spilled records with the in-memory records  (as 
> shown in ②), generates new records, and output the new records onto HDFS (as 
> shown in ③).
> *2. Dataflow and memory usage analysis*
> I added some profiling code and obtained dataflow and memory usage metrics as 
> follows. Ki represents the _i_-th key, Ri represents the _i_-th row in the 
> table.
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/OOM/DataflowAndMemoryUsage.png|width=100%!
>   Figure 1: Dataflow and Memory Usage Analysis (see 
> https://github.com/JerryLead/Misc/blob/master/SparkPRFigures/OOM/SPARK-22286-OOM.pdf
>  for the high-definition version)
> The concrete phases with metrics are as follows.
> *[Shuffle read]* records = 7,540,235, bytes = 903 MB
> *[In-memory store]* As shown in the following log, about 5,243,424

[jira] [Updated] (SPARK-26948) vertex and edge rowkey upgrade and support multiple types?

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26948:
--
Priority: Minor  (was: Critical)

I think that has been proposed before; search JIRA. It has some perf 
implications but isn't crazy at all. It's not "Critical" though

> vertex and edge rowkey upgrade and support multiple types?
> --
>
> Key: SPARK-26948
> URL: https://issues.apache.org/jira/browse/SPARK-26948
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: daile
>Priority: Minor
>
> Currently only Long is supported, but most of the graph databases use string 
> as the primary key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-02-21 Thread Shivu Sondur (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774725#comment-16774725
 ] 

Shivu Sondur commented on SPARK-26944:
--

[~abellina] 

Can you mention more info, like Jenkins link, where issue is happening and 
other details.

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2019-02-21 Thread V.Vinoth Kumar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774711#comment-16774711
 ] 

V.Vinoth Kumar commented on SPARK-12009:


HI [~SuYan]

Can someone reopen this ticket? I am facing the same issue now.
{panel:title=ErrorLog}
19/02/21 11:47:23 INFO scheduler.DAGScheduler: Job 6 failed: saveAsTextFile at 
XmlFile.scala:139, took 268.146916 s
19/02/21 11:47:23 INFO scheduler.DAGScheduler: ShuffleMapStage 9 (rdd at 
XmlFile.scala:89) failed in 268.129 s due to Stage cancelled because 
SparkContext was shut down
19/02/21 11:47:23 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event 
SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@4887d60b)
19/02/21 11:47:23 INFO scheduler.DAGScheduler: ShuffleMapStage 10 (rdd at 
XmlFile.scala:89) failed in 268.148 s due to Stage cancelled because 
SparkContext was shut down
19/02/21 11:47:23 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event 
SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@10aecbf7)
19/02/21 11:47:23 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event 
SparkListenerJobEnd(6,1550767643703,JobFailed(org.apache.spark.SparkException: 
Job 6 cancelled because SparkContext was shut down))
org.apache.spark.SparkException: Job 6 cancelled because SparkContext was shut 
down
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:820)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:818)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at 
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:818)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1732)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1651)
at 
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1921)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1920)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2075)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1151)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1096)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1096)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1070)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1035)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1035)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at

[jira] [Updated] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Huon Wilson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huon Wilson updated SPARK-26964:

Description: 
Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON 
document ({{json: element}}) consists of a value surrounded by whitespace 
({{element: ws value ws}}), where a value is an object or array _or_ a number 
or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

NB. these newer specs differ to the original [RFC4627| 
https://tools.ietf.org/html/rfc4627] (which is now obsolete) that (essentially) 
had {{value: object | array}}.

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.

  was:
Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON 
document ({{json: element}}) consists of a value surrounded by whitespace 
({{element: ws value ws}}), where a value is an object or array _or_ a number 
or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

NB. this differs to the original (now obsolete) [RFC4627| 
https://tools.ietf.org/html/rfc4627], which (essentially) had {{value: object | 
array}}.

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.


> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Huon Wilson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huon Wilson updated SPARK-26964:

Description: 
Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON 
document ({{json: element}}) consists of a value surrounded by whitespace 
({{element: ws value ws}}), where a value is an object or array _or_ a number 
or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

NB. this differs to the original (now obsolete) [RFC4627| 
https://tools.ietf.org/html/rfc4627], which (essentially) had {{value: object | 
array}}.

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.

  was:
Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON 
document ({{json: element}}) consists of a value surrounded by whitespace 
({{element: ws value ws}}), where a value is an object or array _or_ a number 
or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

NB. this differs to the original (now obsolete) [RFC4627| 
https://tools.ietf.org/html/rfc4627].

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.


> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. this differs to the original (now obsolete) [RFC4627| 
> https://tools.ietf.org/html/rfc4627], which (essentially) had {{value: object 
> | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Huon Wilson (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huon Wilson updated SPARK-26964:

Description: 
Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a JSON 
document ({{json: element}}) consists of a value surrounded by whitespace 
({{element: ws value ws}}), where a value is an object or array _or_ a number 
or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

NB. this differs to the original (now obsolete) [RFC4627| 
https://tools.ietf.org/html/rfc4627].

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.

  was:
Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org: a JSON document ({{json: element}}) consists of a value 
surrounded by whitespace ({{element: ws value ws}}), where a value is an object 
or array _or_ a number or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.


> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. this differs to the original (now obsolete) [RFC4627| 
> https://tools.ietf.org/html/rfc4627].
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Huon Wilson (JIRA)

Huon Wilson created SPARK-26964:
---

 Summary: to_json/from_json do not match JSON spec due to not 
supporting scalars
 Key: SPARK-26964
 URL: https://issues.apache.org/jira/browse/SPARK-26964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: Huon Wilson


Spark SQL's {{to_json}} and {{from_json}} currently support arrays and objects, 
but not the scalar/primitive types. This doesn't match the JSON spec on 
https://www.json.org: a JSON document ({{json: element}}) consists of a value 
surrounded by whitespace ({{element: ws value ws}}), where a value is an object 
or array _or_ a number or string etc.:

{code:none}
value
object
array
string
number
"true"
"false"
"null"
{code}

Having {{to_json}} and {{from_json}} support scalars would make them flexible 
enough for a library I'm working on, where an arbitrary (user-supplied) column 
needs to be turned into JSON.

This is related to SPARK-24391 and SPARK-25252, which added support for arrays 
of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26960) Reduce flakiness of Spark ML Listener test suite by waiting for listener bus to clear

2019-02-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26960.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23863
[https://github.com/apache/spark/pull/23863]

> Reduce flakiness of Spark ML Listener test suite by waiting for listener bus 
> to clear
> -
>
> Key: SPARK-26960
> URL: https://issues.apache.org/jira/browse/SPARK-26960
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 3.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 3.0.0
>
>
> [SPARK-23674] added SparkListeners for some spark.ml events, as well as a 
> test suite.  I've observed flakiness in the test suite (though I 
> unfortunately only have local logs for failures, not permalinks).  Looking at 
> logs, here's my assessment, which I confirmed with [~zsxwing].
> Test failure I saw:
> {code}
> [info] - pipeline read/write events *** FAILED *** (10 seconds, 253 
> milliseconds)
> [info]   The code passed to eventually never returned normally. Attempted 20 
> times over 10.01409 seconds. Last failure message: Unexpected event 
> thrown: 
> SaveInstanceEnd(org.apache.spark.ml.util.DefaultParamsWriter@6daf89c2,/home/jenkins/workspace/mllib/target/tmp/spark-572952d8-5d2d-4a6c-bd4f-940bb0bbc3d5/pipeline/stages/0_writableStage).
>  (MLEventsSuite.scala:190)
> [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
> [info]   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
> [info]   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
> [info]   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:190)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:165)
> [info]   at 
> org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:314)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply$mcV$sp(MLEventsSuite.scala:165)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
> [info]   at 
> org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
> [info]   at 
> org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:284)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:184)
> [info]   at 
> org.apache.spark.SparkFunSuite.runWithCredentials(SparkFunSuite.scala:301)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:165)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
> [info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
> [info]   at scala.collection.immutable.List.foreach(List.scala:392)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2019-02-21 Thread Xianyin Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774675#comment-16774675
 ] 

Xianyin Xin commented on SPARK-21067:
-

Upload a patch which is based on 2.3.2.

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>Priority: Major
> Attachments: SPARK-21067.patch
>
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at

[jira] [Updated] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2019-02-21 Thread Xianyin Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-21067:

Attachment: SPARK-21067.patch

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>Priority: Major
> Attachments: SPARK-21067.patch
>
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at

[jira] [Resolved] (SPARK-20314) Inconsistent error handling in JSON parsing SQL functions

2019-02-21 Thread John Zhuge (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge resolved SPARK-20314.

Resolution: Duplicate

Resolve it as duplicate.

> Inconsistent error handling in JSON parsing SQL functions
> -
>
> Key: SPARK-20314
> URL: https://issues.apache.org/jira/browse/SPARK-20314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Wasserman
>Priority: Major
>
> Most parse errors in the JSON parsing SQL functions (e.g. json_tuple, 
> get_json_object) will return a null(s) if the JSON is badly formed. However, 
> if Jackson determines that the string includes invalid characters it will 
> throw an exception (java.io.CharConversionException: Invalid UTF-32 
> character) that Spark does not catch. This creates a robustness problem in 
> that these functions cannot be used at all when there may be dirty data as 
> these exceptions will kill the jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26963) SizeEstimator can't make some JDK fields accessible in Java 9+

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26963:


Assignee: Sean Owen  (was: Apache Spark)

> SizeEstimator can't make some JDK fields accessible in Java 9+
> --
>
> Key: SPARK-26963
> URL: https://issues.apache.org/jira/browse/SPARK-26963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Copied from https://github.com/apache/spark/pull/23804#issuecomment-466198782
> Under Java 11, tests fail with:
> {code}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.935 
> s <<< FAILURE! - in org.apache.spark.ml.regression.JavaGBTRegressorSuite
> [ERROR] runDT(org.apache.spark.ml.regression.JavaGBTRegressorSuite)  Time 
> elapsed: 0.933 s  <<< ERROR!
> java.lang.reflect.InaccessibleObjectException: Unable to make field 
> jdk.internal.ref.PhantomCleanable jdk.internal.ref.PhantomCleanable.prev 
> accessible: module java.base does not "opens jdk.internal.ref" to unnamed 
> module @4212a0c8
>   at 
> org.apache.spark.ml.regression.JavaGBTRegressorSuite.runDT(JavaGBTRegressorSuite.java:65)
> [INFO] Running org.apache.spark.ml.regression.JavaDecisionTreeRegressorSuite
> [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.43 s 
> - in org.apache.spark.ml.regression.JavaDecisionTreeRegressorSuite
> [INFO] Running org.apache.spark.ml.regression.JavaRandomForestRegressorSuite
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.507 
> s <<< FAILURE! - in 
> org.apache.spark.ml.regression.JavaRandomForestRegressorSuite
> [ERROR] runDT(org.apache.spark.ml.regression.JavaRandomForestRegressorSuite)  
> Time elapsed: 0.506 s  <<< ERROR!
> java.lang.reflect.InaccessibleObjectException: Unable to make field 
> jdk.internal.ref.PhantomCleanable jdk.internal.ref.PhantomCleanable.prev 
> accessible: module java.base does not "opens jdk.internal.ref" to unnamed 
> module @4212a0c8
>   at 
> org.apache.spark.ml.regression.JavaRandomForestRegressorSuite.runDT(JavaRandomForestRegressorSuite.java:88)
> {code}
> Stack trace:
> {code}
> at 
> java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:337)
> at 
> java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:281)
> at 
> java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:176)
> at java.base/java.lang.reflect.Field.setAccessible(Field.java:170)
> at 
> org.apache.spark.util.SizeEstimator$.$anonfun$getClassInfo$2(SizeEstimator.scala:337)
> at 
> org.apache.spark.util.SizeEstimator$.$anonfun$getClassInfo$2$adapted(SizeEstimator.scala:331)
> at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at 
> org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:331)
> at 
> org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:325)
> at 
> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:223)
> at 
> org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:202)
> at 
> org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:70)
> at 
> org.apache.spark.util.collection.SizeTracker.takeSample(SizeTracker.scala:78)
> at 
> org.apache.spark.util.collection.SizeTracker.afterUpdate(SizeTracker.scala:70)
> at 
> org.apache.spark.util.collection.SizeTracker.afterUpdate$(SizeTracker.scala:67)
> at 
> org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
> at 
> org.apache.spark.storage.memory.DeserializedValuesHolder.storeValue(MemoryStore.scala:665)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1166)
> at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1092)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1157)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:915)
> at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1482)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:133)

[jira] [Assigned] (SPARK-26963) SizeEstimator can't make some JDK fields accessible in Java 9+

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26963:


Assignee: Apache Spark  (was: Sean Owen)

> SizeEstimator can't make some JDK fields accessible in Java 9+
> --
>
> Key: SPARK-26963
> URL: https://issues.apache.org/jira/browse/SPARK-26963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Copied from https://github.com/apache/spark/pull/23804#issuecomment-466198782
> Under Java 11, tests fail with:
> {code}
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.935 
> s <<< FAILURE! - in org.apache.spark.ml.regression.JavaGBTRegressorSuite
> [ERROR] runDT(org.apache.spark.ml.regression.JavaGBTRegressorSuite)  Time 
> elapsed: 0.933 s  <<< ERROR!
> java.lang.reflect.InaccessibleObjectException: Unable to make field 
> jdk.internal.ref.PhantomCleanable jdk.internal.ref.PhantomCleanable.prev 
> accessible: module java.base does not "opens jdk.internal.ref" to unnamed 
> module @4212a0c8
>   at 
> org.apache.spark.ml.regression.JavaGBTRegressorSuite.runDT(JavaGBTRegressorSuite.java:65)
> [INFO] Running org.apache.spark.ml.regression.JavaDecisionTreeRegressorSuite
> [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.43 s 
> - in org.apache.spark.ml.regression.JavaDecisionTreeRegressorSuite
> [INFO] Running org.apache.spark.ml.regression.JavaRandomForestRegressorSuite
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.507 
> s <<< FAILURE! - in 
> org.apache.spark.ml.regression.JavaRandomForestRegressorSuite
> [ERROR] runDT(org.apache.spark.ml.regression.JavaRandomForestRegressorSuite)  
> Time elapsed: 0.506 s  <<< ERROR!
> java.lang.reflect.InaccessibleObjectException: Unable to make field 
> jdk.internal.ref.PhantomCleanable jdk.internal.ref.PhantomCleanable.prev 
> accessible: module java.base does not "opens jdk.internal.ref" to unnamed 
> module @4212a0c8
>   at 
> org.apache.spark.ml.regression.JavaRandomForestRegressorSuite.runDT(JavaRandomForestRegressorSuite.java:88)
> {code}
> Stack trace:
> {code}
> at 
> java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:337)
> at 
> java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:281)
> at 
> java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:176)
> at java.base/java.lang.reflect.Field.setAccessible(Field.java:170)
> at 
> org.apache.spark.util.SizeEstimator$.$anonfun$getClassInfo$2(SizeEstimator.scala:337)
> at 
> org.apache.spark.util.SizeEstimator$.$anonfun$getClassInfo$2$adapted(SizeEstimator.scala:331)
> at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
> at 
> org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:331)
> at 
> org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:325)
> at 
> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:223)
> at 
> org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:202)
> at 
> org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:70)
> at 
> org.apache.spark.util.collection.SizeTracker.takeSample(SizeTracker.scala:78)
> at 
> org.apache.spark.util.collection.SizeTracker.afterUpdate(SizeTracker.scala:70)
> at 
> org.apache.spark.util.collection.SizeTracker.afterUpdate$(SizeTracker.scala:67)
> at 
> org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
> at 
> org.apache.spark.storage.memory.DeserializedValuesHolder.storeValue(MemoryStore.scala:665)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1166)
> at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1092)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1157)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:915)
> at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1482)
> at 
>

[jira] [Created] (SPARK-26963) SizeEstimator can't make some JDK fields accessible in Java 9+

2019-02-21 Thread Sean Owen (JIRA)

Sean Owen created SPARK-26963:
-

 Summary: SizeEstimator can't make some JDK fields accessible in 
Java 9+
 Key: SPARK-26963
 URL: https://issues.apache.org/jira/browse/SPARK-26963
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


Copied from https://github.com/apache/spark/pull/23804#issuecomment-466198782

Under Java 11, tests fail with:

{code}
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.935 s 
<<< FAILURE! - in org.apache.spark.ml.regression.JavaGBTRegressorSuite
[ERROR] runDT(org.apache.spark.ml.regression.JavaGBTRegressorSuite)  Time 
elapsed: 0.933 s  <<< ERROR!
java.lang.reflect.InaccessibleObjectException: Unable to make field 
jdk.internal.ref.PhantomCleanable jdk.internal.ref.PhantomCleanable.prev 
accessible: module java.base does not "opens jdk.internal.ref" to unnamed 
module @4212a0c8
at 
org.apache.spark.ml.regression.JavaGBTRegressorSuite.runDT(JavaGBTRegressorSuite.java:65)

[INFO] Running org.apache.spark.ml.regression.JavaDecisionTreeRegressorSuite
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.43 s - 
in org.apache.spark.ml.regression.JavaDecisionTreeRegressorSuite
[INFO] Running org.apache.spark.ml.regression.JavaRandomForestRegressorSuite
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.507 s 
<<< FAILURE! - in org.apache.spark.ml.regression.JavaRandomForestRegressorSuite
[ERROR] runDT(org.apache.spark.ml.regression.JavaRandomForestRegressorSuite)  
Time elapsed: 0.506 s  <<< ERROR!
java.lang.reflect.InaccessibleObjectException: Unable to make field 
jdk.internal.ref.PhantomCleanable jdk.internal.ref.PhantomCleanable.prev 
accessible: module java.base does not "opens jdk.internal.ref" to unnamed 
module @4212a0c8
at 
org.apache.spark.ml.regression.JavaRandomForestRegressorSuite.runDT(JavaRandomForestRegressorSuite.java:88)
{code}

Stack trace:

{code}
at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:337)
at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:281)
at 
java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:176)
at java.base/java.lang.reflect.Field.setAccessible(Field.java:170)
at 
org.apache.spark.util.SizeEstimator$.$anonfun$getClassInfo$2(SizeEstimator.scala:337)
at 
org.apache.spark.util.SizeEstimator$.$anonfun$getClassInfo$2$adapted(SizeEstimator.scala:331)
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at 
org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:331)
at 
org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:325)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:223)
at 
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:202)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:70)
at 
org.apache.spark.util.collection.SizeTracker.takeSample(SizeTracker.scala:78)
at 
org.apache.spark.util.collection.SizeTracker.afterUpdate(SizeTracker.scala:70)
at 
org.apache.spark.util.collection.SizeTracker.afterUpdate$(SizeTracker.scala:67)
at 
org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at 
org.apache.spark.storage.memory.DeserializedValuesHolder.storeValue(MemoryStore.scala:665)
at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1166)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1092)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1157)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:915)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1482)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:133)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:91)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1427)
at

[jira] [Updated] (SPARK-25889) Dynamic allocation load-aware ramp up

2019-02-21 Thread Adam Kennedy (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kennedy updated SPARK-25889:
-
Description: 
The time based exponential ramp up behavior for dynamic allocation is naive and 
destructive, making it very difficult to run very large jobs.

On a large (64,000 core) YARN cluster with a high number of input partitions 
(200,000+) the default dynamic allocation approach of requesting containers in 
waves, doubling exponentially once per second, results in 50% of the entire 
cluster being requested in the final 1 second wave.

This can easily overwhelm RPC processing, or cause expensive Executor startup 
steps to break systems. With the interval so short, many additional containers 
may be requested beyond what is actually needed and then complete very little 
work before sitting around waiting to be deallocated.

Delaying the time between these fixed doublings only has limited impact. 
Setting double intervals to once per minute would result in a very slow ramp up 
speed, at the end of which we still face large potentially crippling waves of 
executor startup.

An alternative approach to spooling up large job appears to be needed, which is 
still relatively simple but could be more adaptable to different cluster sizes 
and differing cluster and job performance.

I would like to propose a few different approaches based around the general 
idea of controlling outstanding requests for new containers based on the number 
of executors that are currently running, for some definition of "running".

One example might be to limit requests to one new executor for every existing 
executor that currently has an active task. Or some ratio of that, to allow for 
more or less aggressive spool up. A lower number would let us approximate 
something like fibonacci ramp up, a higher number of say 2x would spool up 
quickly, but still aligned with the rate at which broadcast blocks can be 
easily distributed to new members.

An alternative approach might be to limit the escalation rate of new executor 
requests based on the number of outstanding executors requested which have not 
yet fully completed startup and are not available for tasks. To protect against 
a potentially suboptimal very early ramp, a minimum concurrent executor startup 
threshold might allow an initial burst of say 10 executors, after which the 
more gradual ramp math would apply.

  was:
The time based exponential ramp up behavior for dynamic allocation is naive and 
destructive, making it very difficult to run very large jobs.

On a large (64,000 core) YARN cluster with a high number of input partitions 
(200,000+) the default dynamic allocation approach of requesting containers in 
waves, doubling exponentially once per second, results in 50% of the entire 
cluster being requested in the final 1 second wave.

This can easily overwhelm RPC processing, or cause expensive Executor startup 
steps to break systems. With the interval so short, many additional containers 
may be requested beyond what is actually needed and then complete very little 
work before sitting around waiting to be deallocated.

Delaying the time between these fixed doublings only has limited impact. 
Setting double intervals to once per minute would result in a very slow ramp up 
speed, at the end of which we still face large potentially crippling waves of 
executor startup.

An alternative approach to spooling up large job appears to be needed, which is 
still relatively simple but could be more adaptable to different cluster sizes 
and differing cluster and job performance.

I would like to propose a few different approaches based around the general 
idea of controlling outstanding requests for new containers based on the number 
of executors that are currently running, for some definition of "running".

One example might be to limit requests to one new executor for every existing 
executor that currently has an active task. Or some ratio of that, to allow for 
more or less aggressive spool up. A lower number would let us approximate 
something like fibonacci ramp up, a higher number of say 2x would spool up 
quickly, but still aligned with the rate at which broadcast blocks can be 
easily distributed to new members.

 


> Dynamic allocation load-aware ramp up
> -
>
> Key: SPARK-25889
> URL: https://issues.apache.org/jira/browse/SPARK-25889
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, YARN
>Affects Versions: 2.3.2
>Reporter: Adam Kennedy
>Priority: Major
>
> The time based exponential ramp up behavior for dynamic allocation is naive 
> and destructive, making it very difficult to run very large jobs.
> On a large (64,000 core) YARN cluster with a high number of input partitions 
> (200,000+) the default dynamic

[jira] [Updated] (SPARK-26962) Windows Function LEAD in Spark SQL is not fetching consistent results.

2019-02-21 Thread Shiva Sankari Perambalam (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiva Sankari Perambalam updated SPARK-26962:
-
Description: 
Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
id, code order by date) as lead_date from foo"""){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 

  was:
Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
id, code order by date) as lead_date from """){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 


> Windows Function LEAD in Spark SQL is not fetching consistent results.
> --
>
> Key: SPARK-26962
> URL: https://issues.apache.org/jira/browse/SPARK-26962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Shiva Sankari Perambalam
>Priority: Major
>
> Using a Lead function on a DATETIME column is giving inconsistent results in 
> Spark sql.
> {code:java}
> Lead(date) over (partition by id, code order by date){code}
> where Date is DATETIME, id and code a String.
> {code:java}
> val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
> id, code order by date) as lead_date from foo"""){code}
> The result set is sometimes having the same data as the date instead of the 
> lead_date
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26958) Add NestedSchemaPruningBenchmark

2019-02-21 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-26958.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23862
[https://github.com/apache/spark/pull/23862]

> Add NestedSchemaPruningBenchmark
> 
>
> Key: SPARK-26958
> URL: https://issues.apache.org/jira/browse/SPARK-26958
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This adds `NestedSchemaPruningBenchmark` to verify the going PR performance 
> benefits clearly and to prevent the future performance degradation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26962) Windows Function LEAD in Spark SQL is not fetching consistent results.

2019-02-21 Thread Shiva Sankari Perambalam (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiva Sankari Perambalam updated SPARK-26962:
-
Description: 
Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
id, code order by date) as lead_date from """){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 

  was:
Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code is a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
id, code order by date) as lead_date from """){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 


> Windows Function LEAD in Spark SQL is not fetching consistent results.
> --
>
> Key: SPARK-26962
> URL: https://issues.apache.org/jira/browse/SPARK-26962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Shiva Sankari Perambalam
>Priority: Major
>
> Using a Lead function on a DATETIME column is giving inconsistent results in 
> Spark sql.
> {code:java}
> Lead(date) over (partition by id, code order by date){code}
> where Date is DATETIME, id and code a String.
> {code:java}
> val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
> id, code order by date) as lead_date from """){code}
> The result set is sometimes having the same data as the date instead of the 
> lead_date
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26962) Windows Function LEAD in Spark SQL is not fetching consistent results.

2019-02-21 Thread Shiva Sankari Perambalam (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiva Sankari Perambalam updated SPARK-26962:
-
Description: 
Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code is a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
id, code order by date) as lead_date from """){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 

  was:
Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code is a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, Lead(date) over (partition by 
id, code order by date) as lead_date from """){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 


> Windows Function LEAD in Spark SQL is not fetching consistent results.
> --
>
> Key: SPARK-26962
> URL: https://issues.apache.org/jira/browse/SPARK-26962
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Shiva Sankari Perambalam
>Priority: Major
>
> Using a Lead function on a DATETIME column is giving inconsistent results in 
> Spark sql.
> {code:java}
> Lead(date) over (partition by id, code order by date){code}
> where Date is DATETIME, id and code is a String.
> {code:java}
> val testdf1= sparkSession.sql(s""" select date, lead(date) over (partition by 
> id, code order by date) as lead_date from """){code}
> The result set is sometimes having the same data as the date instead of the 
> lead_date
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26962) Windows Function LEAD in Spark SQL is not fetching consistent results.

2019-02-21 Thread Shiva Sankari Perambalam (JIRA)

Shiva Sankari Perambalam created SPARK-26962:


 Summary: Windows Function LEAD in Spark SQL is not fetching 
consistent results.
 Key: SPARK-26962
 URL: https://issues.apache.org/jira/browse/SPARK-26962
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Shiva Sankari Perambalam


Using a Lead function on a DATETIME column is giving inconsistent results in 
Spark sql.
{code:java}
Lead(date) over (partition by id, code order by date){code}
where Date is DATETIME, id and code is a String.
{code:java}
val testdf1= sparkSession.sql(s""" select date, Lead(date) over (partition by 
id, code order by date) as lead_date from """){code}
The result set is sometimes having the same data as the date instead of the 
lead_date

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-02-21 Thread Rong Jialei (JIRA)

Rong Jialei created SPARK-26961:
---

 Summary: Found Java-level deadlock in Spark Driver
 Key: SPARK-26961
 URL: https://issues.apache.org/jira/browse/SPARK-26961
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Rong Jialei


Our spark job usually will finish in minutes, however, we recently found it 
take days to run, and we can only kill it when this happened.

An investigation show all worker container could not connect drive after start, 
and driver is hanging, using jstack, we found a Java-level deadlock.

 

*Jstack output for deadlock part is showing below:*

 

Found one Java-level deadlock:
=
"SparkUI-907":
 waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
org.apache.hadoop.conf.Configuration),
 which is held by "ForkJoinPool-1-worker-57"
"ForkJoinPool-1-worker-57":
 waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
org.apache.spark.util.MutableURLClassLoader),
 which is held by "ForkJoinPool-1-worker-7"
"ForkJoinPool-1-worker-7":
 waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
org.apache.hadoop.conf.Configuration),
 which is held by "ForkJoinPool-1-worker-57"

Java stack information for the threads listed above:
===
"SparkUI-907":
 at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
 - waiting to lock <0x0005c0c1e5e0> (a org.apache.hadoop.conf.Configuration)
 at 
org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
 at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
 at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
 at 
org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
 at java.net.URL.getURLStreamHandler(URL.java:1142)
 at java.net.URL.(URL.java:599)
 at java.net.URL.(URL.java:490)
 at java.net.URL.(URL.java:439)
 at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
 at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
 at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
 at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
 at 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
 at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
 at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
 at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
 at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
 at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
 at 
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
 at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
 at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
 at org.spark_project.jetty.server.Server.handle(Server.java:534)
 at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320)
 at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
 at 
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
 at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
 at 
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
 at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
 at 
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
 at java.lang.Thread.run(Thread.java:748)
"ForkJoinPool-1-worker-57":
 at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
 - waiting to lock <0x0005b7991168> (a 
org.apache.spark.util.MutableURLClassLoader)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at org.apache.xerces.parsers.ObjectFactory.findProviderClass(Unknown Source)
 at org.apache.xerces.parsers.ObjectFactory.newInstance(Unknown Source)
 at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
 at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
 at org.apache.xerces.parsers.DOMParser.(Unknown Source)
 at

[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`

2019-02-21 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774566#comment-16774566
 ] 

Bryan Cutler commented on SPARK-26943:
--

Could you please provide a complete script to reproduce? Also, have you tried a 
newer version of Spark other than 2.1.0?

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26957) Add config properties to configure the default scheduler pool priorities

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26957:


Assignee: (was: Apache Spark)

> Add config properties to configure the default scheduler pool priorities
> 
>
> Key: SPARK-26957
> URL: https://issues.apache.org/jira/browse/SPARK-26957
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>
> Currently, it is possible to dynamically create new scheduler pools in Spark 
> just by setting {{spark.scheduler.pool.}} to a new value.
> We use this capability to create separate pools for different projects that 
> run jobs in the same long-lived driver application. Each project gets its own 
> pool, and within the pool jobs are executed in a FIFO manner.
> This setup works well, except that we also have a low priority queue for 
> background tasks. We would prefer for all of the dynamic pools to have a 
> higher priority than this background queue. 
>  We can accomplish this by hardcoding the project queue names in a 
> spark_pools.xml config file and setting their priority to 100.
> Unfortunately, there is no way to set the priority for dynamically created 
> pools.  They are all hardcoded to 1.  It would be nice if there were 
> configuration settings to change this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26957) Add config properties to configure the default scheduler pool priorities

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26957:


Assignee: Apache Spark

> Add config properties to configure the default scheduler pool priorities
> 
>
> Key: SPARK-26957
> URL: https://issues.apache.org/jira/browse/SPARK-26957
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, it is possible to dynamically create new scheduler pools in Spark 
> just by setting {{spark.scheduler.pool.}} to a new value.
> We use this capability to create separate pools for different projects that 
> run jobs in the same long-lived driver application. Each project gets its own 
> pool, and within the pool jobs are executed in a FIFO manner.
> This setup works well, except that we also have a low priority queue for 
> background tasks. We would prefer for all of the dynamic pools to have a 
> higher priority than this background queue. 
>  We can accomplish this by hardcoding the project queue names in a 
> spark_pools.xml config file and setting their priority to 100.
> Unfortunately, there is no way to set the priority for dynamically created 
> pools.  They are all hardcoded to 1.  It would be nice if there were 
> configuration settings to change this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10856) SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of TIMESTAMP

2019-02-21 Thread Srinivas Gade (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774473#comment-16774473
 ] 

Srinivas Gade commented on SPARK-10856:
---

I see similar issue with the datatype (bit), can someone confirm whether this 
has been addressed. I am using the Spark 2.1.1

Below is the error message:

java.sql.SQLException: Column, parameter, or variable #5: Cannot specify a 
column width on data type bit.

> SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of 
> TIMESTAMP
> ---
>
> Key: SPARK-10856
> URL: https://issues.apache.org/jira/browse/SPARK-10856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Henrik Behrens
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: patch
> Fix For: 1.6.0
>
>
> When saving a DataFrame to MS SQL Server, en error is thrown if there is more 
> than one TIMESTAMP column:
> df.printSchema
> root
>  |-- Id: string (nullable = false)
>  |-- TypeInformation_CreatedBy: string (nullable = false)
>  |-- TypeInformation_ModifiedBy: string (nullable = true)
>  |-- TypeInformation_TypeStatus: integer (nullable = false)
>  |-- TypeInformation_CreatedAtDatabase: timestamp (nullable = false)
>  |-- TypeInformation_ModifiedAtDatabase: timestamp (nullable = true)
> df.write.mode("overwrite").jdbc(url, tablename, props)
> com.microsoft.sqlserver.jdbc.SQLServerException: A table can only have one 
> timestamp column. Because table 'DebtorTypeSet1' already has one, the column 
> 'TypeInformation_ModifiedAtDatabase' cannot be added.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError
> (SQLServerException.java:217)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServ
> erStatement.java:1635)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePrep
> aredStatement(SQLServerPreparedStatement.java:426)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecC
> md.doExecute(SQLServerPreparedStatement.java:372)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:6276)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLSe
> rverConnection.java:1793)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLSer
> verStatement.java:184)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLS
> erverStatement.java:159)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeUpdate
> (SQLServerPreparedStatement.java:315)
> I tested this on Windows and SQL Server 12 using Spark 1.4.1.
> I think this can be fixed in a similar way to Spark-10419.
> As a refererence, here is the type mapping according to the SQL Server JDBC 
> driver (basicDT.java, extracted from sqljdbc_4.2.6420.100_enu.exe):
>private static void displayRow(String title, ResultSet rs) {
>   try {
>  System.out.println(title);
>  System.out.println(rs.getInt(1) + " , " +// SQL integer 
> type.
>rs.getString(2) + " , " +  // SQL char 
> type.
>rs.getString(3) + " , " +  // SQL varchar 
> type.
>rs.getBoolean(4) + " , " + // SQL bit type.
>rs.getDouble(5) + " , " +  // SQL decimal 
> type.
>rs.getDouble(6) + " , " +  // SQL money 
> type.
>rs.getTimestamp(7) + " , " +   // SQL datetime 
> type.
>rs.getDate(8) + " , " +// SQL date 
> type.
>rs.getTime(9) + " , " +// SQL time 
> type.
>rs.getTimestamp(10) + " , " +  // SQL 
> datetime2 type.
>((SQLServerResultSet)rs).getDateTimeOffset(11)); // SQL 
> datetimeoffset type. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26959) Join of two tables, bucketed the same way, on bucket columns and one or more other coulmns should not need a shuffle

2019-02-21 Thread Natang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Natang updated SPARK-26959:
---
Affects Version/s: 2.4.0

> Join of two tables, bucketed the same way, on bucket columns and one or more 
> other coulmns should not need a shuffle
> 
>
> Key: SPARK-26959
> URL: https://issues.apache.org/jira/browse/SPARK-26959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.4.0
>Reporter: Natang
>Priority: Major
>
> _When two tables, that are bucketed the same way, are joined using bucket 
> columns and one or more other columns, Spark should be able to perform the 
> join without doing a shuffle._
> Consider the example below. There are two tables, 'join_left_table' and 
> 'join_right_table', bucketed by 'col1' into 4 buckets. When these tables are 
> joined on 'col1' and 'col2', Spark should be able to do the join without 
> having to do a shuffle. All entries for a give value of 'col1' would be in 
> the same bucket for both the tables, irrespective of values of 'col2'.
>  
> 
>  
>  
> {noformat}
> def randomInt1to100 = scala.util.Random.nextInt(100)+1
> val left = sc.parallelize(
>   Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
> ).toDF("col1", "col2", "col3")
> val right = sc.parallelize(
>   Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
> ).toDF("col1", "col2", "col3")
> import org.apache.spark.sql.SaveMode
> left.write
> .bucketBy(4,"col1")
> .sortBy("col1", "col2")
> .mode(SaveMode.Overwrite)
> .saveAsTable("join_left_table")
> 
> right.write
> .bucketBy(4,"col1")
> .sortBy("col1", "col2")
> .mode(SaveMode.Overwrite)
> .saveAsTable("join_right_table")
> val left_table = spark.read.table("join_left_table")
> val right_table = spark.read.table("join_right_table")
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val join_on_col1=left_table.join(
> right_table,
> Seq("col1"))
> join_on_col1.explain
> ### BEGIN Output
> join_on_col1: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 3 
> more fields]
> == Physical Plan ==
> *Project [col1#250, col2#251, col3#252, col2#258, col3#259]
> +- *SortMergeJoin [col1#250], [col1#257], Inner
>:- *Sort [col1#250 ASC NULLS FIRST], false, 0
>:  +- *Project [col1#250, col2#251, col3#252]
>: +- *Filter isnotnull(col1#250)
>:+- *FileScan parquet 
> default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
> struct
>+- *Sort [col1#257 ASC NULLS FIRST], false, 0
>   +- *Project [col1#257, col2#258, col3#259]
>  +- *Filter isnotnull(col1#257)
> +- *FileScan parquet 
> default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
> struct
> ### END Output
> val join_on_col1_col2=left_table.join(
> right_table,
> Seq("col1","col2"))
> join_on_col1_col2.explain
> ### BEGIN Output
> join_on_col1_col2: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 
> 2 more fields]
> == Physical Plan ==
> *Project [col1#250, col2#251, col3#252, col3#259]
> +- *SortMergeJoin [col1#250, col2#251], [col1#257, col2#258], Inner
>:- *Sort [col1#250 ASC NULLS FIRST, col2#251 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(col1#250, col2#251, 200)
>: +- *Project [col1#250, col2#251, col3#252]
>:+- *Filter (isnotnull(col2#251) && isnotnull(col1#250))
>:   +- *FileScan parquet 
> default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col2), IsNotNull(col1)], 
> ReadSchema: struct
>+- *Sort [col1#257 ASC NULLS FIRST, col2#258 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(col1#257, col2#258, 200)
>  +- *Project [col1#257, col2#258, col3#259]
> +- *Filter (isnotnull(col2#258) && isnotnull(col1#257))
>+- *FileScan parquet 
> default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
>  PartitionFilters: [], PushedFilters:

[jira] [Updated] (SPARK-26960) Reduce flakiness of Spark ML Listener test suite by waiting for listener bus to clear

2019-02-21 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-26960:
--
Description: 
[SPARK-23674] added SparkListeners for some spark.ml events, as well as a test 
suite.  I've observed flakiness in the test suite (though I unfortunately only 
have local logs for failures, not permalinks).  Looking at logs, here's my 
assessment, which I confirmed with [~zsxwing].

Test failure I saw:
{code}
[info] - pipeline read/write events *** FAILED *** (10 seconds, 253 
milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 20 
times over 10.01409 seconds. Last failure message: Unexpected event thrown: 
SaveInstanceEnd(org.apache.spark.ml.util.DefaultParamsWriter@6daf89c2,/home/jenkins/workspace/mllib/target/tmp/spark-572952d8-5d2d-4a6c-bd4f-940bb0bbc3d5/pipeline/stages/0_writableStage).
 (MLEventsSuite.scala:190)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:190)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:165)
[info]   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:314)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply$mcV$sp(MLEventsSuite.scala:165)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:284)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:184)
[info]   at 
org.apache.spark.SparkFunSuite.runWithCredentials(SparkFunSuite.scala:301)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:165)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1147)
[info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at

[jira] [Created] (SPARK-26960) Reduce flakiness of Spark ML Listener test suite by waiting for listener bus to clear

2019-02-21 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-26960:
-

 Summary: Reduce flakiness of Spark ML Listener test suite by 
waiting for listener bus to clear
 Key: SPARK-26960
 URL: https://issues.apache.org/jira/browse/SPARK-26960
 Project: Spark
  Issue Type: Improvement
  Components: ML, Tests
Affects Versions: 3.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


[SPARK-23674] added SparkListeners for some spark.ml events, as well as a test 
suite.  I've observed flakiness in the test suite (though I unfortunately only 
have local logs for failures, not permalinks).  Looking at logs, here's my 
assessment, which I confirmed with [~zsxwing].

Test failure I saw:
{code}
[info] - pipeline read/write events *** FAILED *** (10 seconds, 253 
milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 20 
times over 10.01409 seconds. Last failure message: Unexpected event thrown: 
SaveInstanceEnd(org.apache.spark.ml.util.DefaultParamsWriter@6daf89c2,/home/jenkins/workspace/mllib/target/tmp/spark-572952d8-5d2d-4a6c-bd4f-940bb0bbc3d5/pipeline/stages/0_writableStage).
 (MLEventsSuite.scala:190)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:190)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:165)
[info]   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:314)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply$mcV$sp(MLEventsSuite.scala:165)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:284)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:184)
[info]   at 
org.apache.spark.SparkFunSuite.runWithCredentials(SparkFunSuite.scala:301)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:165)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1147)
[info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
[info]   at

[jira] [Created] (SPARK-26959) Join of two tables, bucketed the same way, on bucket columns and one or more other coulmns should not need a shuffle

2019-02-21 Thread Natang (JIRA)

Natang created SPARK-26959:
--

 Summary: Join of two tables, bucketed the same way, on bucket 
columns and one or more other coulmns should not need a shuffle
 Key: SPARK-26959
 URL: https://issues.apache.org/jira/browse/SPARK-26959
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Natang


_When two tables, that are bucketed the same way, are joined using bucket 
columns and one or more other columns, Spark should be able to perform the join 
without doing a shuffle._

Consider the example below. There are two tables, 'join_left_table' and 
'join_right_table', bucketed by 'col1' into 4 buckets. When these tables are 
joined on 'col1' and 'col2', Spark should be able to do the join without having 
to do a shuffle. All entries for a give value of 'col1' would be in the same 
bucket for both the tables, irrespective of values of 'col2'.

 

 

 
{noformat}
def randomInt1to100 = scala.util.Random.nextInt(100)+1

val left = sc.parallelize(
  Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")

val right = sc.parallelize(
  Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")


import org.apache.spark.sql.SaveMode

left.write
.bucketBy(4,"col1")
.sortBy("col1", "col2")
.mode(SaveMode.Overwrite)
.saveAsTable("join_left_table")

right.write
.bucketBy(4,"col1")
.sortBy("col1", "col2")
.mode(SaveMode.Overwrite)
.saveAsTable("join_right_table")


val left_table = spark.read.table("join_left_table")

val right_table = spark.read.table("join_right_table")

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val join_on_col1=left_table.join(
right_table,
Seq("col1"))

join_on_col1.explain

### BEGIN Output
join_on_col1: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 3 more 
fields]
== Physical Plan ==
*Project [col1#250, col2#251, col3#252, col2#258, col3#259]
+- *SortMergeJoin [col1#250], [col1#257], Inner
   :- *Sort [col1#250 ASC NULLS FIRST], false, 0
   :  +- *Project [col1#250, col2#251, col3#252]
   : +- *Filter isnotnull(col1#250)
   :+- *FileScan parquet 
default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
 PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
struct
   +- *Sort [col1#257 ASC NULLS FIRST], false, 0
  +- *Project [col1#257, col2#258, col3#259]
 +- *Filter isnotnull(col1#257)
+- *FileScan parquet 
default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
 PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
struct
### END Output

val join_on_col1_col2=left_table.join(
right_table,
Seq("col1","col2"))

join_on_col1_col2.explain

### BEGIN Output
join_on_col1_col2: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 2 
more fields]
== Physical Plan ==
*Project [col1#250, col2#251, col3#252, col3#259]
+- *SortMergeJoin [col1#250, col2#251], [col1#257, col2#258], Inner
   :- *Sort [col1#250 ASC NULLS FIRST, col2#251 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(col1#250, col2#251, 200)
   : +- *Project [col1#250, col2#251, col3#252]
   :+- *Filter (isnotnull(col2#251) && isnotnull(col1#250))
   :   +- *FileScan parquet 
default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
 PartitionFilters: [], PushedFilters: [IsNotNull(col2), IsNotNull(col1)], 
ReadSchema: struct
   +- *Sort [col1#257 ASC NULLS FIRST, col2#258 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col1#257, col2#258, 200)
 +- *Project [col1#257, col2#258, col3#259]
+- *Filter (isnotnull(col2#258) && isnotnull(col1#257))
   +- *FileScan parquet 
default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
 PartitionFilters: [], PushedFilters: [IsNotNull(col2), IsNotNull(col1)], 
ReadSchema: struct
### END Output{noformat}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26958) Add NestedSchemaPruningBenchmark

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26958:


Assignee: Apache Spark

> Add NestedSchemaPruningBenchmark
> 
>
> Key: SPARK-26958
> URL: https://issues.apache.org/jira/browse/SPARK-26958
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This adds `NestedSchemaPruningBenchmark` to verify the going PR performance 
> benefits clearly and to prevent the future performance degradation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26958) Add NestedSchemaPruningBenchmark

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26958:


Assignee: (was: Apache Spark)

> Add NestedSchemaPruningBenchmark
> 
>
> Key: SPARK-26958
> URL: https://issues.apache.org/jira/browse/SPARK-26958
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This adds `NestedSchemaPruningBenchmark` to verify the going PR performance 
> benefits clearly and to prevent the future performance degradation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26958) Add NestedSchemaPruningBenchmark

2019-02-21 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-26958:
-

 Summary: Add NestedSchemaPruningBenchmark
 Key: SPARK-26958
 URL: https://issues.apache.org/jira/browse/SPARK-26958
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This adds `NestedSchemaPruningBenchmark` to verify the going PR performance 
benefits clearly and to prevent the future performance degradation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26954.

Resolution: Not A Problem

See discussion in the PR for why this is not a problem (or something that Spark 
should try to solve).

> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless like the example:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Also some attemps when "FileNorFoundException" is thrown in user code looks 
> unreasonable.
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be better that user exception will not attemp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26957) Add config properties to configure the default scheduler pool priorities

2019-02-21 Thread Dave DeCaprio (JIRA)

Dave DeCaprio created SPARK-26957:
-

 Summary: Add config properties to configure the default scheduler 
pool priorities
 Key: SPARK-26957
 URL: https://issues.apache.org/jira/browse/SPARK-26957
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: Dave DeCaprio


Currently, it is possible to dynamically create new scheduler pools in Spark 
just by setting {{spark.scheduler.pool.}} to a new value.

We use this capability to create separate pools for different projects that run 
jobs in the same long-lived driver application. Each project gets its own pool, 
and within the pool jobs are executed in a FIFO manner.

This setup works well, except that we also have a low priority queue for 
background tasks. We would prefer for all of the dynamic pools to have a higher 
priority than this background queue. 
 We can accomplish this by hardcoding the project queue names in a 
spark_pools.xml config file and setting their priority to 100.

Unfortunately, there is no way to set the priority for dynamically created 
pools.  They are all hardcoded to 1.  It would be nice if there were 
configuration settings to change this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26956) remove streaming output mode from data source v2 APIs

2019-02-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26956:

Summary: remove streaming output mode from data source v2 APIs  (was: 
translate streaming output mode to write operators)

> remove streaming output mode from data source v2 APIs
> -
>
> Key: SPARK-26956
> URL: https://issues.apache.org/jira/browse/SPARK-26956
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26956) remove streaming output mode from data source v2 APIs

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26956:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove streaming output mode from data source v2 APIs
> -
>
> Key: SPARK-26956
> URL: https://issues.apache.org/jira/browse/SPARK-26956
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26956) remove streaming output mode from data source v2 APIs

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26956:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove streaming output mode from data source v2 APIs
> -
>
> Key: SPARK-26956
> URL: https://issues.apache.org/jira/browse/SPARK-26956
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26956) translate streaming output mode to write operators

2019-02-21 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-26956:
---

 Summary: translate streaming output mode to write operators
 Key: SPARK-26956
 URL: https://issues.apache.org/jira/browse/SPARK-26956
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24935) Problem with Executing Hive UDF's from Spark 2.2 Onwards

2019-02-21 Thread Parth Gandhi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774170#comment-16774170
 ] 

Parth Gandhi commented on SPARK-24935:
--

Thank you [~gavin_hu] for reporting the issue in the first place. Have sent an 
email to the Spark dev mailing list requesting them to push the fix for Spark 
2.4.1. Will let you know in case of any updates.

> Problem with Executing Hive UDF's from Spark 2.2 Onwards
> 
>
> Key: SPARK-24935
> URL: https://issues.apache.org/jira/browse/SPARK-24935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Parth Gandhi
>Priority: Major
>
> A user of sketches library(https://github.com/DataSketches/sketches-hive) 
> reported an issue with HLL Sketch Hive UDAF that seems to be a bug in Spark 
> or Hive. Their code runs fine in 2.1 but has an issue from 2.2 onwards. For 
> more details on the issue, you can refer to the discussion in the 
> sketches-user list:
> [https://groups.google.com/forum/?utm_medium=email_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ]
>  
> On further debugging, we figured out that from 2.2 onwards, Spark hive UDAF 
> provides support for partial aggregation, and has removed the functionality 
> that supported complete mode aggregation(Refer 
> https://issues.apache.org/jira/browse/SPARK-19060 and 
> https://issues.apache.org/jira/browse/SPARK-18186). Thus, instead of 
> expecting update method to be called, merge method is called here 
> ([https://github.com/DataSketches/sketches-hive/blob/master/src/main/java/com/yahoo/sketches/hive/hll/SketchEvaluator.java#L56)]
>  which throws the exception as described in the forums above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26919) change maven default compile java home

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26919.
---
   Resolution: Not A Problem
Fix Version/s: (was: 2.4.0)

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26917) CacheManager blocks while traversing plans

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26917.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23833
[https://github.com/apache/spark/pull/23833]

> CacheManager blocks while traversing plans
> --
>
> Key: SPARK-26917
> URL: https://issues.apache.org/jira/browse/SPARK-26917
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: We are running on AWS EMR 5.20, so Spark 2.4.0.
>Reporter: Dave DeCaprio
>Assignee: Dave DeCaprio
>Priority: Minor
> Fix For: 3.0.0
>
>
> This is related to SPARK-26548 and SPARK-26617.  The CacheManager is further 
> locking during the recacheByCondition operation.  For large plans evaluation 
> of the condition can take a long time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26917) CacheManager blocks while traversing plans

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26917:
-

Assignee: Dave DeCaprio

> CacheManager blocks while traversing plans
> --
>
> Key: SPARK-26917
> URL: https://issues.apache.org/jira/browse/SPARK-26917
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: We are running on AWS EMR 5.20, so Spark 2.4.0.
>Reporter: Dave DeCaprio
>Assignee: Dave DeCaprio
>Priority: Minor
>
> This is related to SPARK-26548 and SPARK-26617.  The CacheManager is further 
> locking during the recacheByCondition operation.  For large plans evaluation 
> of the condition can take a long time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26919) change maven default compile java home

2019-02-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-26919:
---

> change maven default compile java home
> --
>
> Key: SPARK-26919
> URL: https://issues.apache.org/jira/browse/SPARK-26919
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.1
>Reporter: daile
>Priority: Critical
> Fix For: 2.4.0
>
> Attachments: p1.png
>
>
>   when i use "build/mvn -DskipTests clean package"  the deafult java home 
> Configuration "
> ${java.home}". I tried the environment of mac os and winodws and found that 
> the default java.home is */jre but the jre environment does not have the 
> javac complie command. So I think it can be replaced with the system 
> environment variable and the test is successfully compiled.
> !image-2019-02-19-10-25-02-872.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26955) Align Spark's TimSort to JDK 11 TimSort

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26955:


Assignee: Apache Spark

> Align Spark's TimSort to JDK 11 TimSort
> ---
>
> Key: SPARK-26955
> URL: https://issues.apache.org/jira/browse/SPARK-26955
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> There are a couple differences in Spark's TimSort and JDK 11 TimSort:
> - Additional check in mergeCollapse introduced by 
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239#l1.34
> - And increased constants for stackLen 
> (http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/TimSort.java#l184):
> The ticket aims to address the differences.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26955) Align Spark's TimSort to JDK 11 TimSort

2019-02-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774115#comment-16774115
 ] 

Apache Spark commented on SPARK-26955:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23858

> Align Spark's TimSort to JDK 11 TimSort
> ---
>
> Key: SPARK-26955
> URL: https://issues.apache.org/jira/browse/SPARK-26955
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> There are a couple differences in Spark's TimSort and JDK 11 TimSort:
> - Additional check in mergeCollapse introduced by 
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239#l1.34
> - And increased constants for stackLen 
> (http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/TimSort.java#l184):
> The ticket aims to address the differences.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26955) Align Spark's TimSort to JDK 11 TimSort

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26955:


Assignee: (was: Apache Spark)

> Align Spark's TimSort to JDK 11 TimSort
> ---
>
> Key: SPARK-26955
> URL: https://issues.apache.org/jira/browse/SPARK-26955
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> There are a couple differences in Spark's TimSort and JDK 11 TimSort:
> - Additional check in mergeCollapse introduced by 
> http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239#l1.34
> - And increased constants for stackLen 
> (http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/TimSort.java#l184):
> The ticket aims to address the differences.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26955) Align Spark's TimSort to JDK 11 TimSort

2019-02-21 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-26955:
--

 Summary: Align Spark's TimSort to JDK 11 TimSort
 Key: SPARK-26955
 URL: https://issues.apache.org/jira/browse/SPARK-26955
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Maxim Gekk


There are a couple differences in Spark's TimSort and JDK 11 TimSort:
- Additional check in mergeCollapse introduced by 
http://hg.openjdk.java.net/jdk/jdk/rev/3a6d47df8239#l1.34
- And increased constants for stackLen 
(http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/TimSort.java#l184):

The ticket aims to address the differences.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18364) Expose metrics for YarnShuffleService

2019-02-21 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774084#comment-16774084
 ] 

Sean Owen commented on SPARK-18364:
---

There isn't going to be a 2.5 

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Assignee: Marek Simunek
>Priority: Major
> Fix For: 3.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18364) Expose metrics for YarnShuffleService

2019-02-21 Thread Marek Simunek (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774078#comment-16774078
 ] 

Marek Simunek commented on SPARK-18364:
---

[~srowen] Why it was moved from 2.5 to release 3.0?

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Assignee: Marek Simunek
>Priority: Major
> Fix For: 3.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26954:


Assignee: (was: Apache Spark)

> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless like the example:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Also some attemps when "FileNorFoundException" is thrown in user code looks 
> unreasonable.
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be better that user exception will not attemp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773998#comment-16773998
 ] 

Apache Spark commented on SPARK-26954:
--

User 'deshanxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23857

> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless like the example:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Also some attemps when "FileNorFoundException" is thrown in user code looks 
> unreasonable.
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be better that user exception will not attemp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773997#comment-16773997
 ] 

Apache Spark commented on SPARK-26954:
--

User 'deshanxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23857

> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless like the example:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Also some attemps when "FileNorFoundException" is thrown in user code looks 
> unreasonable.
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be better that user exception will not attemp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26954:


Assignee: Apache Spark

> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless like the example:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Also some attemps when "FileNorFoundException" is thrown in user code looks 
> unreasonable.
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be better that user exception will not attemp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26953) Test TimSort for ArrayIndexOutOfBoundsException

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26953:


Assignee: Apache Spark

> Test TimSort for ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-26953
> URL: https://issues.apache.org/jira/browse/SPARK-26953
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The paper (https://arxiv.org/pdf/1805.08612.pdf at the end) shows a case when 
> TimSort can cause  ArrayIndexOutOfBoundsException. In particular, the test in 
> Java is http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java. The test 
> allocates huge arrays of ints but it seems it is not necessary. Probably, 
> smaller array of bytes can be used in test.
> The ticket aims to add a test which checks Spark's TimSort doesn't cause 
> ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26953) Test TimSort for ArrayIndexOutOfBoundsException

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26953:


Assignee: (was: Apache Spark)

> Test TimSort for ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-26953
> URL: https://issues.apache.org/jira/browse/SPARK-26953
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The paper (https://arxiv.org/pdf/1805.08612.pdf at the end) shows a case when 
> TimSort can cause  ArrayIndexOutOfBoundsException. In particular, the test in 
> Java is http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java. The test 
> allocates huge arrays of ints but it seems it is not necessary. Probably, 
> smaller array of bytes can be used in test.
> The ticket aims to add a test which checks Spark's TimSort doesn't cause 
> ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread deshanxiao (JIRA)

deshanxiao created SPARK-26954:
--

 Summary: Do not attemp when user code throws exception
 Key: SPARK-26954
 URL: https://issues.apache.org/jira/browse/SPARK-26954
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.4.0, 2.3.3
Reporter: deshanxiao


Yarn attemps the failed App depending on YarnRMClient#unregister. However, some 
attemps are useless:

{code:java}
sc.parallelize(Seq(1,2,3)).map(_ => throw new 
RuntimeException("exception")).collect()
{code}

Some environment errors, such as node dead, attemps reasonablely.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread deshanxiao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao updated SPARK-26954:
---
Description: 
Yarn attemps the failed App depending on YarnRMClient#unregister. However, some 
attemps are useless like the example:

{code:java}
sc.parallelize(Seq(1,2,3)).map(_ => throw new 
RuntimeException("exception")).collect()
{code}

Also some attemps when "FileNorFoundException" is thrown in user code looks 
unreasonable.

Some environment errors, such as node dead, attemps reasonablely. So, it will 
be better that user exception will not attemp.


  was:
Yarn attemps the failed App depending on YarnRMClient#unregister. However, some 
attemps are useless:

{code:java}
sc.parallelize(Seq(1,2,3)).map(_ => throw new 
RuntimeException("exception")).collect()
{code}

Some environment errors, such as node dead, attemps reasonablely. So, it will 
be bettler to at



> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless like the example:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Also some attemps when "FileNorFoundException" is thrown in user code looks 
> unreasonable.
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be better that user exception will not attemp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-02-21 Thread Mani M (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773978#comment-16773978
 ] 

Mani M commented on SPARK-26918:


How to update this file in Git Spark project

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26954) Do not attemp when user code throws exception

2019-02-21 Thread deshanxiao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshanxiao updated SPARK-26954:
---
Description: 
Yarn attemps the failed App depending on YarnRMClient#unregister. However, some 
attemps are useless:

{code:java}
sc.parallelize(Seq(1,2,3)).map(_ => throw new 
RuntimeException("exception")).collect()
{code}

Some environment errors, such as node dead, attemps reasonablely. So, it will 
be bettler to at


  was:
Yarn attemps the failed App depending on YarnRMClient#unregister. However, some 
attemps are useless:

{code:java}
sc.parallelize(Seq(1,2,3)).map(_ => throw new 
RuntimeException("exception")).collect()
{code}

Some environment errors, such as node dead, attemps reasonablely.



> Do not attemp when user code throws exception
> -
>
> Key: SPARK-26954
> URL: https://issues.apache.org/jira/browse/SPARK-26954
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.3, 2.4.0
>Reporter: deshanxiao
>Priority: Critical
>
> Yarn attemps the failed App depending on YarnRMClient#unregister. However, 
> some attemps are useless:
> {code:java}
> sc.parallelize(Seq(1,2,3)).map(_ => throw new 
> RuntimeException("exception")).collect()
> {code}
> Some environment errors, such as node dead, attemps reasonablely. So, it will 
> be bettler to at



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26953) Test TimSort for ArrayIndexOutOfBoundsException

2019-02-21 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-26953:
--

 Summary: Test TimSort for ArrayIndexOutOfBoundsException
 Key: SPARK-26953
 URL: https://issues.apache.org/jira/browse/SPARK-26953
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The paper (https://arxiv.org/pdf/1805.08612.pdf at the end) shows a case when 
TimSort can cause  ArrayIndexOutOfBoundsException. In particular, the test in 
Java is http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java. The test allocates 
huge arrays of ints but it seems it is not necessary. Probably, smaller array 
of bytes can be used in test.

The ticket aims to add a test which checks Spark's TimSort doesn't cause 
ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26869) UDF with struct requires to have _1 and _2 as struct field names

2019-02-21 Thread JIRA



[ 
https://issues.apache.org/jira/browse/SPARK-26869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773946#comment-16773946
 ] 

Andrés Doncel Ramírez commented on SPARK-26869:
---

Thanks for your answer [~nimfadora]. For now I'll stick to the workaround I 
found.

> UDF with struct requires to have _1 and _2 as struct field names
> 
>
> Key: SPARK-26869
> URL: https://issues.apache.org/jira/browse/SPARK-26869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu 18.04.1 LTS
>Reporter: Andrés Doncel Ramírez
>Priority: Minor
>
> When using a UDF which has a Seq of tuples as input, the struct field names 
> need to match "_1" and "_2". The following code illustrates this:
>  
> {code:java}
> val df = sc.parallelize(Array(
>   ("1",3.0),
>   ("2",4.5),
>   ("5",2.0)
> )
> ).toDF("c1","c2")
> val df1=df.agg(collect_list(struct("c1","c2")).as("c3"))
> // Changing column names to _1 and _2 when creating the struct
> val 
> df2=df.agg(collect_list(struct(col("c1").as("_1"),col("c2").as("_2"))).as("c3"))
> def takeUDF = udf({ (xs: Seq[(String, Double)]) =>
>   xs.take(2)
> })
> df1.printSchema
> df2.printSchema
> df1.withColumn("c4",takeUDF(col("c3"))).show() // this fails
> df2.withColumn("c4",takeUDF(col("c3"))).show() // this works
> {code}
> The first one returns the following exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(c3)' due to data 
> type mismatch: argument 1 requires array> type, 
> however, '`c3`' is of array> type.;;
> While the second works as expected and prints the result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2019-02-21 Thread angerszhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773890#comment-16773890
 ] 

angerszhu commented on SPARK-21918:
---

Come back boy, :([~huLiu]

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>Priority: Major
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26930:


Assignee: Apache Spark

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Assignee: Apache Spark
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26930) Tests in ParquetFilterSuite don't verify filter class

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26930:


Assignee: (was: Apache Spark)

> Tests in ParquetFilterSuite don't verify filter class
> -
>
> Key: SPARK-26930
> URL: https://issues.apache.org/jira/browse/SPARK-26930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Nandor Kollar
>Priority: Minor
>
> While investigating Parquet predicate pushdown test cases, I noticed that 
> several tests seems to be broken, they don't test what they were originally 
> intended to. Most of the verification ends up in one of the overloaded 
> checkFilterPredicate functions, which supposed to test if a given filter 
> class is generated or not with this call: {{maybeFilter.exists(_.getClass === 
> filterClass)}}, but on one side an assert is missing from here, on the other 
> side, the filters are more complicated, for example equality is checked with 
> an 'and' wrapping not null check along with an equality check for the given 
> value. 'Exists' function call won't help with these compounds filters, since 
> they are not collection instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2019-02-21 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773861#comment-16773861
 ] 

Wenchen Fan commented on SPARK-21492:
-

[~taoluo] Thanks for the detailed explanation! I kind of agree that this is a 
memory leak, although the memory can be released when the task completes.

The problematic pattern is: an operator consumes only a part of records from 
its child, so that the child can't release the last page which stores the last 
record it outputs. The child has no idea if the last record has been consumed 
by the parent or not, so it's not safe to release the last page, as doing so 
would make the last record corrupted. SMJ and limit are 2 places that I can 
think of that have this pattern.

So we need a mechanism to allow the parent to tell the child that it can 
release all the resources. Closable iterator is an option here, but we should 
not hack it in SMJ only.

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0
>Reporter: Zhan Zhang
>Priority: Major
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22000:


Assignee: Apache Spark

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Assignee: Apache Spark
>Priority: Major
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22000:


Assignee: (was: Apache Spark)

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Priority: Major
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2019-02-21 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773855#comment-16773855
 ] 

Jungtaek Lim commented on SPARK-22000:
--

Thanks [~xsergey] I can easily reproduce it in master branch and have a fix. 
Will raise a PR shortly.
Credit to [~srowen] given I'm leveraging String.valueOf(). Thanks!

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>Priority: Major
> Attachments: testcase.zip
>
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26952) Row count statics should respect the data reported by data source

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26952:


Assignee: Apache Spark

> Row count statics should respect the data reported by data source
> -
>
> Key: SPARK-26952
> URL: https://issues.apache.org/jira/browse/SPARK-26952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>Priority: Major
>
> In data source v2, if the data source scan implemented 
> `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row 
> count reported by the data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26952) Row count statics should respect the data reported by data source

2019-02-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26952:


Assignee: (was: Apache Spark)

> Row count statics should respect the data reported by data source
> -
>
> Key: SPARK-26952
> URL: https://issues.apache.org/jira/browse/SPARK-26952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xianyang Liu
>Priority: Major
>
> In data source v2, if the data source scan implemented 
> `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row 
> count reported by the data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26952) Row count statics should respect the data reported by data source

2019-02-21 Thread Xianyang Liu (JIRA)

Xianyang Liu created SPARK-26952:


 Summary: Row count statics should respect the data reported by 
data source
 Key: SPARK-26952
 URL: https://issues.apache.org/jira/browse/SPARK-26952
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xianyang Liu


In data source v2, if the data source scan implemented 
`SupportsReportStatistics`. `DataSourceV2Relation` should respect the row count 
reported by the data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

100 matches

Mail list logo