date:20180802

[jira] [Resolved] (SPARK-24997) Support MINUS ALL

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24997.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.4.0

> Support MINUS ALL
> -
>
> Key: SPARK-24997
> URL: https://issues.apache.org/jira/browse/SPARK-24997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> MINUS is synonym for EXCEPT. We have added support for EXCEPT ALL. We need to 
> enable support for MINUS ALL as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24788) RelationalGroupedDataset.toString throws errors when grouping by UnresolvedAttribute

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24788.
-
   Resolution: Fixed
 Assignee: Chris Horn
Fix Version/s: 2.4.0

> RelationalGroupedDataset.toString throws errors when grouping by 
> UnresolvedAttribute
> 
>
> Key: SPARK-24788
> URL: https://issues.apache.org/jira/browse/SPARK-24788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chris Horn
>Assignee: Chris Horn
>Priority: Minor
> Fix For: 2.4.0
>
>
> This causes references to the RelationalGroupedDataset to break on the shell 
> because of the toString call:
> {code:java}
> scala> spark.range(0, 10).groupBy("id")
> res4: org.apache.spark.sql.RelationalGroupedDataset = 
> RelationalGroupedDataset: [grouping expressions: [id: bigint], value: [id: 
> bigint], type: GroupBy]
> scala> spark.range(0, 10).groupBy('id)
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'id
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset$$anonfun$12.apply(RelationalGroupedDataset.scala:474)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset$$anonfun$12.apply(RelationalGroupedDataset.scala:473)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.RelationalGroupedDataset.toString(RelationalGroupedDataset.scala:473)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:332)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:337)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:345)
> {code}
>  
> I will create a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-02 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Priority: Blocker  (was: Major)

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25002) Avro: revise the output record namespace

2018-08-02 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25002.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21974
[https://github.com/apache/spark/pull/21974]

> Avro: revise the output record namespace
> 
>
> Key: SPARK-25002
> URL: https://issues.apache.org/jira/browse/SPARK-25002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the output namespace is starting with "."
> Although it is valid according to Avro spec, we should remove the starting 
> dot in case of failures when the output file is read by other lib:
> https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25002) Avro: revise the output record namespace

2018-08-02 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25002:
---

Assignee: Gengliang Wang

> Avro: revise the output record namespace
> 
>
> Key: SPARK-25002
> URL: https://issues.apache.org/jira/browse/SPARK-25002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the output namespace is starting with "."
> Although it is valid according to Avro spec, we should remove the starting 
> dot in case of failures when the output file is read by other lib:
> https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-02 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Target Version/s: 2.2.3, 2.3.2, 2.4.0

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23911) High-order function: reduce(array, initialState S, inputFunction, outputFunction) → R

2018-08-02 Thread Takuya Ueshin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567791#comment-16567791
 ] 

Takuya Ueshin commented on SPARK-23911:
---

[~smilegator] [~hvanhovell] I'd use {{aggregate}} instead of {{reduce}} for 
this function name because {{aggregate}} better fits the functionality of the 
function. WDYT?

> High-order function: reduce(array, initialState S, inputFunction, 
> outputFunction) → R
> ---
>
> Key: SPARK-23911
> URL: https://issues.apache.org/jira/browse/SPARK-23911
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns a single value reduced from array. inputFunction will be invoked for 
> each element in array in order. In addition to taking the element, 
> inputFunction takes the current state, initially initialState, and returns 
> the new state. outputFunction will be invoked to turn the final state into 
> the result value. It may be the identity function (i -> i).
> {noformat}
> SELECT reduce(ARRAY [], 0, (s, x) -> s + x, s -> s); -- 0
> SELECT reduce(ARRAY [5, 20, 50], 0, (s, x) -> s + x, s -> s); -- 75
> SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + x, s -> s); -- NULL
> SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + COALESCE(x, 0), s -> 
> s); -- 75
> SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> IF(x IS NULL, s, s + x), 
> s -> s); -- 75
> SELECT reduce(ARRAY [2147483647, 1], CAST (0 AS BIGINT), (s, x) -> s + x, s 
> -> s); -- 2147483648
> SELECT reduce(ARRAY [5, 6, 10, 20], -- calculates arithmetic average: 10.25
>   CAST(ROW(0.0, 0) AS ROW(sum DOUBLE, count INTEGER)),
>   (s, x) -> CAST(ROW(x + s.sum, s.count + 1) AS ROW(sum DOUBLE, 
> count INTEGER)),
>   s -> IF(s.count = 0, NULL, s.sum / s.count));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25011) Add PrefixSpan to all

2018-08-02 Thread yuhao yang (JIRA)

yuhao yang created SPARK-25011:
--

 Summary: Add PrefixSpan to __all__
 Key: SPARK-25011
 URL: https://issues.apache.org/jira/browse/SPARK-25011
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0
Reporter: yuhao yang


Add PrefixSpan to __all__ in fpm.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24966) Fix the precedence rule for set operations.

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24966.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24966) Fix the precedence rule for set operations.

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24966:
---

Assignee: Dilip Biswal

> Fix the precedence rule for set operations.
> ---
>
> Key: SPARK-24966
> URL: https://issues.apache.org/jira/browse/SPARK-24966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the set operations INTERSECT, UNION and EXCEPT are assigned the 
> same precedence. We need to change to make sure INTERSECT is given higher 
> precedence than UNION and EXCEPT. UNION and EXCEPT should be evaluated in the 
> order they appear in the query from left to right. 
> Given this will result in a change in behavior, we need to keep it under a 
> config.
> Here is a reference :
> https://docs.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-02 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567708#comment-16567708
 ] 

Hyukjin Kwon commented on SPARK-24924:
--

Similar discussion was made in SPARK-20590 when we port CSV. in my experience, 
users really don't know if {{com.databricks.spark.avro}} or {{avro}} mean 
external Avro jar or internal jar (same thing happened in CSV - 
 I was active in that Spark CSV (databricks) package FWIW).

if users were using the external avro, they will likely meet the error if they 
directly upgrade Spark. Otherwise, users will see the release note that Avro 
package is included in 2.4.0, and they will not provide the external jar.
If users miss the release note, then they will try to explicitly provide the 
thirdparty jar, which will now give the error message like:

{code}
17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
{code}

Encouraging to use builtin's one might better be preferred since the behaviours 
will kept same at its best for now.
Otherwise, If external Avro must be used, I think it can be still used if the 
source is specified by fully qualified path in theory.

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24977) input_file_name() result can't save and use for partitionBy()

2018-08-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24977.
---
Resolution: Not A Problem

This isn't nearly sufficient detail for a JIRA, and evidence it isn't a problem.

> input_file_name() result can't save and use for partitionBy()
> -
>
> Key: SPARK-24977
> URL: https://issues.apache.org/jira/browse/SPARK-24977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Srinivasarao Padala
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24977) input_file_name() result can't save and use for partitionBy()

2018-08-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24977:
--
Target Version/s:   (was: 2.2.1)

> input_file_name() result can't save and use for partitionBy()
> -
>
> Key: SPARK-24977
> URL: https://issues.apache.org/jira/browse/SPARK-24977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Srinivasarao Padala
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24910) Spark Bloom Filter Closure Serialization improvement for very high volume of Data

2018-08-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24910:
--
Shepherd:   (was: Sean Owen)
   Flags:   (was: Patch)
  Labels:   (was: bloom-filter)
Priority: Minor  (was: Major)

> Spark Bloom Filter Closure Serialization improvement for very high volume of 
> Data
> -
>
> Key: SPARK-24910
> URL: https://issues.apache.org/jira/browse/SPARK-24910
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Himangshu Ranjan Borah
>Priority: Minor
>
> I am proposing an improvement to the Bloom Filter Generation logic being used 
> in the DataFrameStatFunctions' Bloom Filter API using mapPartitions() instead 
> of aggregate() to avoid closure serialization which fails for huge BitArrays.
> Spark's Stat Functions' Bloom Filter Implementation uses 
> aggregate/treeAggregate operations which uses a closure with a dependency on 
> the bloom filter that is created in the driver. Since Spark hard codes the 
> closure serializer to Java Serializer it fails in closure cleanup for very 
> big sizes of Bloom Filters (Typically with num items ~ Billions and with fpp 
> ~ 0.001). Kryo serializer work's fine in such a scale but seems like there 
> were some issues using Kryo for closure serialization due to which Spark 2.0 
> hardcoded it to Java. The call-stack that we get typically looks like,
> {{{color:#f79232}java.lang.OutOfMemoryError{color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123){color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117){color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93){color}}}
> {{{color:#f79232} at 
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178){color}}}
> {{{color:#f79232} at 
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348){color}}}
> {{{color:#f79232} at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43){color}}}
> {{{color:#f79232} at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335){color}}}
> {{{color:#f79232} at 
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159){color}}}
> {{{color:#f79232} at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2292){color}}}
> {{{color:#f79232} at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2022){color}}}
> {{{color:#f79232} at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2124){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}}
> {{{color:#f79232} at org.apache.spark.rdd.RDD.fold(RDD.scala:1086){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}}
> {{{color:#f79232} at 
> org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}}
> {{{color:#f79232} at 
>

[jira] [Commented] (SPARK-10413) ML models should support prediction on single instances

2018-08-02 Thread zhengruifeng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567683#comment-16567683
 ] 

zhengruifeng commented on SPARK-10413:
--

Is there a plan to expose predict in clustering algs?  [~mengxr]

 

I just encounter a case in which this feature is needed.

> ML models should support prediction on single instances
> ---
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567677#comment-16567677
 ] 

Imran Rashid commented on SPARK-25009:
--

[~devaraj.k] I don't think SPARK-22941 is in 2.3.1, so I changed the Affects 
Versions, please let me know if I'm mistaken.

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Devaraj K
>Priority: Critical
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25009:
-
Affects Version/s: (was: 2.3.1)
   2.4.0

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Devaraj K
>Priority: Critical
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25009:
-
Priority: Critical  (was: Blocker)

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Devaraj K
>Priority: Critical
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25009:
-
Priority: Blocker  (was: Major)

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Devaraj K
>Priority: Blocker
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24945) Switch to uniVocity >= 2.7.2

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24945:


Assignee: Maxim Gekk

> Switch to uniVocity >= 2.7.2
> 
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . And the recent 
> version has better performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24945) Switch to uniVocity >= 2.7.2

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24945.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21969
[https://github.com/apache/spark/pull/21969]

> Switch to uniVocity >= 2.7.2
> 
>
> Key: SPARK-24945
> URL: https://issues.apache.org/jira/browse/SPARK-24945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> The recent version 2.7.2 of uniVocity parser includes the fix: 
> https://github.com/uniVocity/univocity-parsers/issues/250 . And the recent 
> version has better performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24773:


Assignee: Gengliang Wang

> support reading AVRO logical types - Timestamp with different precisions
> 
>
> Key: SPARK-24773
> URL: https://issues.apache.org/jira/browse/SPARK-24773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24773) support reading AVRO logical types - Timestamp with different precisions

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24773.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21935
[https://github.com/apache/spark/pull/21935]

> support reading AVRO logical types - Timestamp with different precisions
> 
>
> Key: SPARK-24773
> URL: https://issues.apache.org/jira/browse/SPARK-24773
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25010) Rand/Randn should produce different values for each execution in streaming query

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567592#comment-16567592
 ] 

Apache Spark commented on SPARK-25010:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21980

> Rand/Randn should produce different values for each execution in streaming 
> query
> 
>
> Key: SPARK-25010
> URL: https://issues.apache.org/jira/browse/SPARK-25010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Like Uuid in SPARK-24896, Rand and Randn expressions now produce the same 
> results for each execution in streaming query. It doesn't make too much sense 
> for streaming queries. We should make them produce different results as Uuid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25010) Rand/Randn should produce different values for each execution in streaming query

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25010:


Assignee: (was: Apache Spark)

> Rand/Randn should produce different values for each execution in streaming 
> query
> 
>
> Key: SPARK-25010
> URL: https://issues.apache.org/jira/browse/SPARK-25010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Like Uuid in SPARK-24896, Rand and Randn expressions now produce the same 
> results for each execution in streaming query. It doesn't make too much sense 
> for streaming queries. We should make them produce different results as Uuid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25010) Rand/Randn should produce different values for each execution in streaming query

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25010:


Assignee: Apache Spark

> Rand/Randn should produce different values for each execution in streaming 
> query
> 
>
> Key: SPARK-25010
> URL: https://issues.apache.org/jira/browse/SPARK-25010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Like Uuid in SPARK-24896, Rand and Randn expressions now produce the same 
> results for each execution in streaming query. It doesn't make too much sense 
> for streaming queries. We should make them produce different results as Uuid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2018-08-02 Thread Ye Zhou (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou resolved SPARK-21961.
-
Resolution: Won't Fix

> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
>Priority: Major
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 23GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
> We have deployed our Spark History Server with this filter which works fine 
> in our production cluster, which has processed thousands of logs and only got 
> several full GC in total.
> !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
> !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-02 Thread Matthew Normyle (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567578#comment-16567578
 ] 

Matthew Normyle commented on SPARK-24928:
-

{color:#cc7832}val {color}largeRDD = 
sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1000{color})(Random.nextInt))
{color:#cc7832}val {color}smallRDD = 
sc.parallelize({color:#9876aa}Seq{color}.fill({color:#6897bb}1{color})(Random.nextInt))

*(1)* largeRDD.cartesian(smallRDD).count()

*(2)* smallRDD.cartesian(largeRDD).count()

 

Building from master, I can see that (1) consistently takes about twice as long 
as (2) on my machine.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25010) Rand/Randn should produce different values for each execution in streaming query

2018-08-02 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-25010:
---

 Summary: Rand/Randn should produce different values for each 
execution in streaming query
 Key: SPARK-25010
 URL: https://issues.apache.org/jira/browse/SPARK-25010
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Like Uuid in SPARK-24896, Rand and Randn expressions now produce the same 
results for each execution in streaming query. It doesn't make too much sense 
for streaming queries. We should make them produce different results as Uuid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22219) Refector "spark.sql.codegen.comments"

2018-08-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22219.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19449
[https://github.com/apache/spark/pull/19449]

> Refector "spark.sql.codegen.comments"
> -
>
> Key: SPARK-22219
> URL: https://issues.apache.org/jira/browse/SPARK-22219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> To get a value for {{"spark.sql.codegen.comments"}} is not the latest 
> approach. This refactoring is to use better approach to get a value for 
> {{"spark.sql.codegen.comments"}} .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22219) Refector "spark.sql.codegen.comments"

2018-08-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22219:
-

Assignee: Kazuaki Ishizaki

> Refector "spark.sql.codegen.comments"
> -
>
> Key: SPARK-22219
> URL: https://issues.apache.org/jira/browse/SPARK-22219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> To get a value for {{"spark.sql.codegen.comments"}} is not the latest 
> approach. This refactoring is to use better approach to get a value for 
> {{"spark.sql.codegen.comments"}} .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-08-02 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Target Version/s: 2.4.0

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-08-02 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-24896.
--
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.4.0

> Uuid expression should produce different values in each execution under 
> streaming query
> ---
>
> Key: SPARK-24896
> URL: https://issues.apache.org/jira/browse/SPARK-24896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Uuid's results depend on random seed given during analysis. Thus under 
> streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24896) Uuid expression should produce different values in each execution under streaming query

2018-08-02 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-24896:
-
Affects Version/s: (was: 2.4.0)
   2.3.0
   2.3.1

> Uuid expression should produce different values in each execution under 
> streaming query
> ---
>
> Key: SPARK-24896
> URL: https://issues.apache.org/jira/browse/SPARK-24896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Uuid's results depend on random seed given during analysis. Thus under 
> streaming query, we will have the same uuids in each execution. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24909:


Assignee: Apache Spark

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24909:


Assignee: (was: Apache Spark)

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25009:


Assignee: (was: Apache Spark)

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Devaraj K
>Priority: Major
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25009:


Assignee: Apache Spark

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Major
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567536#comment-16567536
 ] 

Apache Spark commented on SPARK-25009:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/21979

> Standalone Cluster mode application submit is not working
> -
>
> Key: SPARK-25009
> URL: https://issues.apache.org/jira/browse/SPARK-25009
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Devaraj K
>Priority: Major
>
> It is not showing any error while submitting but the app is not running and 
> as well as not showing in the web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25009) Standalone Cluster mode application submit is not working

2018-08-02 Thread Devaraj K (JIRA)

Devaraj K created SPARK-25009:
-

 Summary: Standalone Cluster mode application submit is not working
 Key: SPARK-25009
 URL: https://issues.apache.org/jira/browse/SPARK-25009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Devaraj K


It is not showing any error while submitting but the app is not running and as 
well as not showing in the web UI.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25001) Fix build miscellaneous warnings

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25001:


Assignee: (was: Apache Spark)

> Fix build miscellaneous warnings
> 
>
> Key: SPARK-25001
> URL: https://issues.apache.org/jira/browse/SPARK-25001
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are many warnings in the current build (for instance see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console).
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237:
>  warning: [rawtypes] found raw type: LevelDBIterator
> [warn]   void closeIterator(LevelDBIterator it) throws IOException {
> [warn]  ^
> [warn]   missing type arguments for generic class LevelDBIterator
> [warn]   where T is a type-variable:
> [warn] T extends Object declared in class LevelDBIterator
> [warn] 1 warning
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.group() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn]   bootstrap.group().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn]   bootstrap.childGroup().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112:
>  warning: [deprecation] 
> PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in 
> PooledByteBufAllocator has been deprecated
> [warn] return new PooledByteBufAllocator(
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321:
>  warning: [rawtypes] found raw type: Future
> [warn] public void operationComplete(Future future) throws Exception {
> [warn]   ^
> [warn]   missing type arguments for generic class Future
> [warn]   where V is a type-variable:
> [warn] V extends Object declared in interface Future
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [unchecked]

[jira] [Assigned] (SPARK-25001) Fix build miscellaneous warnings

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25001:


Assignee: Apache Spark

> Fix build miscellaneous warnings
> 
>
> Key: SPARK-25001
> URL: https://issues.apache.org/jira/browse/SPARK-25001
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> There are many warnings in the current build (for instance see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console).
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237:
>  warning: [rawtypes] found raw type: LevelDBIterator
> [warn]   void closeIterator(LevelDBIterator it) throws IOException {
> [warn]  ^
> [warn]   missing type arguments for generic class LevelDBIterator
> [warn]   where T is a type-variable:
> [warn] T extends Object declared in class LevelDBIterator
> [warn] 1 warning
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.group() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn]   bootstrap.group().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn]   bootstrap.childGroup().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112:
>  warning: [deprecation] 
> PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in 
> PooledByteBufAllocator has been deprecated
> [warn] return new PooledByteBufAllocator(
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321:
>  warning: [rawtypes] found raw type: Future
> [warn] public void operationComplete(Future future) throws Exception {
> [warn]   ^
> [warn]   missing type arguments for generic class Future
> [warn]   where V is a type-variable:
> [warn] V extends Object declared in interface Future
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>

[jira] [Created] (SPARK-25008) Add memory mode info to showMemoryUsage in TaskMemoryManager

2018-08-02 Thread Ankur Gupta (JIRA)

Ankur Gupta created SPARK-25008:
---

 Summary: Add memory mode info to showMemoryUsage in 
TaskMemoryManager
 Key: SPARK-25008
 URL: https://issues.apache.org/jira/browse/SPARK-25008
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Ankur Gupta


TaskMemoryManager prints the current memory usage information before throwing 
an OOM exception which is helpful in debugging issues. This log does not have 
the memory mode information which can be also useful to quickly determine which 
memory users need to increase.

This JIRA is to add that information to showMemoryUsage method of 
TaskMemoryManager.

Current logs:
{code}
18/07/03 17:57:16 INFO memory.TaskMemoryManager: Memory used in task 318
18/07/03 17:57:16 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@7f084d1b: 
1024.0 KB
18/07/03 17:57:16 INFO memory.TaskMemoryManager: Acquired by 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@713d50f2: 32.0 KB
18/07/03 17:57:16 INFO memory.TaskMemoryManager: 0 bytes of memory were used by 
task 318 but are not associated with specific consumers
18/07/03 17:57:16 INFO memory.TaskMemoryManager: 1081344 bytes of memory are 
used for execution and 306201016 bytes of memory are used for storage
18/07/03 17:57:16 ERROR executor.Executor: Exception in task 86.0 in stage 49.0 
(TID 318)
java.lang.OutOfMemoryError: Unable to acquire 326284160 bytes of memory, got 
3112960
 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:127)
 at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:359)
 at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:382)
 at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:246)
 at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
 at org.apache.spark.scheduler.Task.run(Task.scala:108)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25007) Add transform / array_except /array_union / array_shuffle to SparkR

2018-08-02 Thread Huaxin Gao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567521#comment-16567521
 ] 

Huaxin Gao commented on SPARK-25007:


I will work on this. Thanks!

> Add transform / array_except /array_union / array_shuffle to SparkR
> ---
>
> Key: SPARK-25007
> URL: https://issues.apache.org/jira/browse/SPARK-25007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R version of 
>  * transform -SPARK-23908-
>  * array_except -SPARK-23915- 
>  * array_union -SPARK-23914- 
>  * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25007) Add transform / array_except /array_union / array_shuffle to SparkR

2018-08-02 Thread Huaxin Gao (JIRA)

Huaxin Gao created SPARK-25007:
--

 Summary: Add transform / array_except /array_union / array_shuffle 
to SparkR
 Key: SPARK-25007
 URL: https://issues.apache.org/jira/browse/SPARK-25007
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Huaxin Gao


Add R version of 
 * transform -SPARK-23908-
 * array_except -SPARK-23915- 
 * array_union -SPARK-23914- 
 * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25006) Add optional catalog to TableIdentifier

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25006:


Assignee: Apache Spark

> Add optional catalog to TableIdentifier
> ---
>
> Key: SPARK-25006
> URL: https://issues.apache.org/jira/browse/SPARK-25006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> For multi-catalog support, Spark table identifiers need to identify the 
> catalog for a table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25006) Add optional catalog to TableIdentifier

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567513#comment-16567513
 ] 

Apache Spark commented on SPARK-25006:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21978

> Add optional catalog to TableIdentifier
> ---
>
> Key: SPARK-25006
> URL: https://issues.apache.org/jira/browse/SPARK-25006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> For multi-catalog support, Spark table identifiers need to identify the 
> catalog for a table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25006) Add optional catalog to TableIdentifier

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25006:


Assignee: (was: Apache Spark)

> Add optional catalog to TableIdentifier
> ---
>
> Key: SPARK-25006
> URL: https://issues.apache.org/jira/browse/SPARK-25006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> For multi-catalog support, Spark table identifiers need to identify the 
> catalog for a table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23908) High-order function: transform(array, function) → array

2018-08-02 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-23908:
-

Assignee: Takuya Ueshin  (was: Herman van Hovell)

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25006) Add optional catalog to TableIdentifier

2018-08-02 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-25006:
-

 Summary: Add optional catalog to TableIdentifier
 Key: SPARK-25006
 URL: https://issues.apache.org/jira/browse/SPARK-25006
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue


For multi-catalog support, Spark table identifiers need to identify the catalog 
for a table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-08-02 Thread Quentin Ambard (JIRA)

Quentin Ambard created SPARK-25005:
--

 Summary: Structured streaming doesn't support kafka transaction 
(creating empty offset with abort & markers)
 Key: SPARK-25005
 URL: https://issues.apache.org/jira/browse/SPARK-25005
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Quentin Ambard


Structured streaming can't consume kafka transaction. 
We could try to apply SPARK-24720 (DStream) logic to Structured Streaming source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24720) kafka transaction creates Non-consecutive Offsets (due to transaction offset) making streaming fail when failOnDataLoss=true

2018-08-02 Thread Quentin Ambard (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quentin Ambard updated SPARK-24720:
---
Component/s: (was: Structured Streaming)
 DStreams

> kafka transaction creates Non-consecutive Offsets (due to transaction offset) 
> making streaming fail when failOnDataLoss=true
> 
>
> Key: SPARK-24720
> URL: https://issues.apache.org/jira/browse/SPARK-24720
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Priority: Major
>
> When kafka transactions are used, sending 1 message to kafka will result to 1 
> offset for the data + 1 offset to mark the transaction.
> When kafka connector for spark streaming read a topic with non-consecutive 
> offset, it leads to a failure. SPARK-17147 fixed this issue for compacted 
> topics.
>  However, SPARK-17147 doesn't fix this issue for kafka transactions: if 1 
> message + 1 transaction commit are in a partition, spark will try to read 
> offsets  [0 2[. offset 0 (containing the message) will be read, but offset 1 
> won't return a value and buffer.hasNext() will be false even after a poll 
> since no data are present for offset 1 (it's the transaction commit)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24705) Spark.sql.adaptive.enabled=true is enabled and self-join query

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24705.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Spark.sql.adaptive.enabled=true is enabled and self-join query
> --
>
> Key: SPARK-24705
> URL: https://issues.apache.org/jira/browse/SPARK-24705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: cheng
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: Error stack.txt
>
>
> [~smilegator]
> When loading data using jdbc and enabling spark.sql.adaptive.enabled=true , 
> for example loading a tableA table, unexpected results can occur when you use 
> the following query.
> For example:
> device_loc table comes from the jdbc data source
> select tv_a.imei
> from ( select a.imei,a.speed from device_loc a) tv_a
> inner join ( select a.imei,a.speed from device_loc a ) tv_b on tv_a.imei = 
> tv_b.imei
> group by tv_a.imei
> When the cache tabel device_loc is executed before this query is executed, 
> everything is fine,However, if you do not execute cache table, unexpected 
> results will occur, resulting in failure to execute.
> Remarks：Attachment records the stack when the error occurred



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20964:

Target Version/s: 3.0.0

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23908) High-order function: transform(array, function) → array

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23908.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> High-order function: transform(array, function) → array
> ---
>
> Key: SPARK-23908
> URL: https://issues.apache.org/jira/browse/SPARK-23908
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array that is the result of applying function to each element of 
> array:
> {noformat}
> SELECT transform(ARRAY [], x -> x + 1); -- []
> SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7]
> SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1); -- [6, 1, 7]
> SELECT transform(ARRAY ['x', 'abc', 'z'], x -> x || '0'); -- ['x0', 'abc0', 
> 'z0']
> SELECT transform(ARRAY [ARRAY [1, NULL, 2], ARRAY[3, NULL]], a -> filter(a, x 
> -> x IS NOT NULL)); -- [[1, 2], [3]]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25004:


Assignee: (was: Apache Spark)

> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567396#comment-16567396
 ] 

Apache Spark commented on SPARK-25004:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21977

> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25004:


Assignee: Apache Spark

> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-08-02 Thread Mingjie Tang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567393#comment-16567393
 ] 

Mingjie Tang commented on SPARK-24615:
--

>From user's perspective, user only concern about the GPU resource for RDD, and 
>do not understand the stage or partitions of RDD. Therefore, underline 
>resource allocation mechanism would  assign the resources to executor 
>automatically. 

Similar as cache or persistence to different level, maybe we can provide 
different configuration to users. Then, resource allocation to follow the 
predefined policy to allocate resource. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-02 Thread Ryan Blue (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-25004:
--
Description: 
Some platforms support limiting Python's addressable memory space by limiting 
[{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].

We've found that adding a limit is very useful when running in YARN because 
when Python doesn't know about memory constraints, it doesn't know when to 
garbage collect and will continue using memory when it doesn't need to. Adding 
a limit reduces PySpark memory consumption and avoids YARN killing containers 
because Python hasn't cleaned up memory.

This also improves error messages for users, allowing them to see when Python 
is allocating too much memory instead of YARN killing the container:

{code:lang=python}
  File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
fe_engineer
fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
  File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
[]), mat_rec_prep.get(item, []))
  File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
leven_list_compare
permutations = sorted(permutations, reverse=True)
  MemoryError
{code}

  was:
Some platforms support limiting Python's addressable memory space by limiting 
[{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].

We've found that adding a limit is very useful when running in YARN because 
when Python doesn't know about memory constraints, it doesn't know when to 
garbage collect and will continue using memory when it doesn't need to. Adding 
a limit reduces PySpark memory consumption and avoids YARN killing containers 
because Python hasn't cleaned up memory.

This also improves error messages for users, allowing them to see when Python 
is allocating too much memory instead of YARN killing the container:

{code:lang=python}
  File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
fe_engineer
fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
  File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
[]), mat_rec_prep.get(item, []))
  File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
leven_list_compare
permutations = sorted(permutations, reverse=True)
MemoryError
{code}


> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-02 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-25004:
-

 Summary: Add spark.executor.pyspark.memory config to set 
resource.RLIMIT_AS
 Key: SPARK-25004
 URL: https://issues.apache.org/jira/browse/SPARK-25004
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Ryan Blue


Some platforms support limiting Python's addressable memory space by limiting 
[{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].

We've found that adding a limit is very useful when running in YARN because 
when Python doesn't know about memory constraints, it doesn't know when to 
garbage collect and will continue using memory when it doesn't need to. Adding 
a limit reduces PySpark memory consumption and avoids YARN killing containers 
because Python hasn't cleaned up memory.

This also improves error messages for users, allowing them to see when Python 
is allocating too much memory instead of YARN killing the container:

{code:lang=python}
  File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
fe_engineer
fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
  File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
[]), mat_rec_prep.get(item, []))
  File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
leven_list_compare
permutations = sorted(permutations, reverse=True)
MemoryError
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567383#comment-16567383
 ] 

Reynold Xin commented on SPARK-14220:
-

This is awesome! Congrats!

 

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-02 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567367#comment-16567367
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

Btw I have started working on a PR on this so I expect many more things to come 
up.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-02 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567358#comment-16567358
 ] 

Erik Erlandson commented on SPARK-24817:


I have been looking at the use cases for barrier-mode on the design doc. The 
primary story seems to be along the lines of using {{mapPartitions}} to:
 # write out any partitioned data (and sync)
 # execute some kind of ML logic (TF, etc) (possibly syncing on stages here?)
 # optionally move back into "normal" spark executions

My mental model has been that the value proposition for Hydrogen is primarily a 
convergence argument: it is easier to not have to leave a Spark workflow and 
execute something like TF using some other toolchain. But OTOH, given that the 
Spark programmer has to write out the partitioned data and then invoke ML 
tooling like TF regardless, does the increase to convenience pay for the cost 
in complexity for absorbing new clustering & scheduling models into Spark, 
along with other consequences, for example SPARK-24615, compared to the "null 
hypothesis" of writing partition data, then using ML-specific clustering 
toolchains (kubeflow, for example), and consuming the resulting products in 
Spark afterward.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-02 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567335#comment-16567335
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

Thanks [~rvesse] I will have a look.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-08-02 Thread Erik Erlandson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567301#comment-16567301
 ] 

Erik Erlandson commented on SPARK-24817:


Thanks [~jiangxb] - I'd expect that design to work out-of-box on the k8s 
backend. 

ML-specific code seems like it will have needs that are harder to predict, by 
definition. If it can use IP addresses in the cluster space, it should work 
regardless. If it wants fqdn, then perhaps additional pod configurations will 
be required.

> Implement BarrierTaskContext.barrier()
> --
>
> Key: SPARK-24817
> URL: https://issues.apache.org/jira/browse/SPARK-24817
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Implement BarrierTaskContext.barrier(), to support global sync between all 
> the tasks in a barrier stage. The global sync shall finish immediately once 
> all tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24988) Add a castBySchema method which casts all the values of a DataFrame based on the DataTypes of a StructType

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24988.
--
Resolution: Won't Fix

> Add a castBySchema method which casts all the values of a DataFrame based on 
> the DataTypes of a StructType
> --
>
> Key: SPARK-24988
> URL: https://issues.apache.org/jira/browse/SPARK-24988
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: mahmoud mehdi
>Priority: Minor
>
> The main goal of this User Story is to extend the Dataframe methods in order 
> to add a method which casts all the values of a Dataframe, based on the 
> DataTypes of a StructType.
> This feature can be useful when we have a large dataframe, and that we need 
> to make multiple casts. In that case, we won't have to cast each value 
> independently, all we have to do is to pass a StructType to the method 
> castBySchema with the types we need (In real world examples, this schema is 
> generally provided by the client, which was my case).
> I'll explain the new feature via an example, let's create a dataframe of 
> strings : 
> {code:java}
> val df = Seq(("test1", "0"), ("test2", "1")).toDF("name", "id")
> {code}
> Let's suppose that we want to cast the second column's values of the 
> dataframe to integers, all we have to do is the following : 
> {code:java}
> val schema = StructType( Seq( StructField("name", StringType, true), 
> StructField("id", IntegerType, true))){code}
> {code:java}
> df.castBySchema(schema)
> {code}
> I made sure that castBySchema works also with nested StructTypes by adding 
> several tests.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25003) Pyspark Does not use Spark Sql Extensions

2018-08-02 Thread Russell Spitzer (JIRA)

Russell Spitzer created SPARK-25003:
---

 Summary: Pyspark Does not use Spark Sql Extensions
 Key: SPARK-25003
 URL: https://issues.apache.org/jira/browse/SPARK-25003
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1, 2.2.2
Reporter: Russell Spitzer


When creating a SparkSession here

[https://github.com/apache/spark/blob/v2.2.2/python/pyspark/sql/session.py#L216]
{code:python}
if jsparkSession is None:
  jsparkSession = self._jvm.SparkSession(self._jsc.sc())
self._jsparkSession = jsparkSession
{code}

I believe it ends up calling the constructor here
https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L85-L87
{code:scala}
  private[sql] def this(sc: SparkContext) {
this(sc, None, None, new SparkSessionExtensions)
  }
{code}

Which creates a new SparkSessionsExtensions object and does not pick up new 
extensions that could have been set in the config like the companion 
getOrCreate does.
https://github.com/apache/spark/blob/v2.2.2/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L928-L944
{code:scala}
//in getOrCreate
// Initialize extensions if the user has defined a configurator class.
val extensionConfOption = 
sparkContext.conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
if (extensionConfOption.isDefined) {
  val extensionConfClassName = extensionConfOption.get
  try {
val extensionConfClass = Utils.classForName(extensionConfClassName)
val extensionConf = extensionConfClass.newInstance()
  .asInstanceOf[SparkSessionExtensions => Unit]
extensionConf(extensions)
  } catch {
// Ignore the error if we cannot find the class or when the class 
has the wrong type.
case e @ (_: ClassCastException |
  _: ClassNotFoundException |
  _: NoClassDefFoundError) =>
  logWarning(s"Cannot use $extensionConfClassName to configure 
session extensions.", e)
  }
}
{code}

I think a quick fix would be to use the getOrCreate method from the companion 
object instead of calling the constructor from the SparkContext. Or we could 
fix this by ensuring that all constructors attempt to pick up custom extensions 
if they are set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25001) Fix build miscellaneous warnings

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567136#comment-16567136
 ] 

Apache Spark commented on SPARK-25001:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21975

> Fix build miscellaneous warnings
> 
>
> Key: SPARK-25001
> URL: https://issues.apache.org/jira/browse/SPARK-25001
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are many warnings in the current build (for instance see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console).
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237:
>  warning: [rawtypes] found raw type: LevelDBIterator
> [warn]   void closeIterator(LevelDBIterator it) throws IOException {
> [warn]  ^
> [warn]   missing type arguments for generic class LevelDBIterator
> [warn]   where T is a type-variable:
> [warn] T extends Object declared in class LevelDBIterator
> [warn] 1 warning
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.group() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn]   bootstrap.group().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn]   bootstrap.childGroup().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112:
>  warning: [deprecation] 
> PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in 
> PooledByteBufAllocator has been deprecated
> [warn] return new PooledByteBufAllocator(
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321:
>  warning: [rawtypes] found raw type: Future
> [warn] public void operationComplete(Future future) throws Exception {
> [warn]   ^
> [warn]   missing type arguments for generic class Future
> [warn]   where V is a type-variable:
> [warn] V extends Object declared in interface Future
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
>

[jira] [Assigned] (SPARK-25002) Avro: revise the output record namespace

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25002:


Assignee: (was: Apache Spark)

> Avro: revise the output record namespace
> 
>
> Key: SPARK-25002
> URL: https://issues.apache.org/jira/browse/SPARK-25002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently the output namespace is starting with "."
> Although it is valid according to Avro spec, we should remove the starting 
> dot in case of failures when the output file is read by other lib:
> https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25002) Avro: revise the output record namespace

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25002:


Assignee: Apache Spark

> Avro: revise the output record namespace
> 
>
> Key: SPARK-25002
> URL: https://issues.apache.org/jira/browse/SPARK-25002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently the output namespace is starting with "."
> Although it is valid according to Avro spec, we should remove the starting 
> dot in case of failures when the output file is read by other lib:
> https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25002) Avro: revise the output record namespace

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567134#comment-16567134
 ] 

Apache Spark commented on SPARK-25002:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21974

> Avro: revise the output record namespace
> 
>
> Key: SPARK-25002
> URL: https://issues.apache.org/jira/browse/SPARK-25002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently the output namespace is starting with "."
> Although it is valid according to Avro spec, we should remove the starting 
> dot in case of failures when the output file is read by other lib:
> https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25001) Fix build miscellaneous warnings

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25001:
-
Summary: Fix build miscellaneous warnings  (was: Handle build warnings in 
common, core, launcher, mllib, sql)

> Fix build miscellaneous warnings
> 
>
> Key: SPARK-25001
> URL: https://issues.apache.org/jira/browse/SPARK-25001
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are many warnings in the current build (for instance see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console).
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237:
>  warning: [rawtypes] found raw type: LevelDBIterator
> [warn]   void closeIterator(LevelDBIterator it) throws IOException {
> [warn]  ^
> [warn]   missing type arguments for generic class LevelDBIterator
> [warn]   where T is a type-variable:
> [warn] T extends Object declared in class LevelDBIterator
> [warn] 1 warning
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.group() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn]   bootstrap.group().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn]   bootstrap.childGroup().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112:
>  warning: [deprecation] 
> PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in 
> PooledByteBufAllocator has been deprecated
> [warn] return new PooledByteBufAllocator(
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321:
>  warning: [rawtypes] found raw type: Future
> [warn] public void operationComplete(Future future) throws Exception {
> [warn]   ^
> [warn]   missing type arguments for generic class Future
> [warn]   where V is a type-variable:
> [warn] V extends Object declared in interface Future
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
>

[jira] [Updated] (SPARK-25002) Avro: revise the output record namespace

2018-08-02 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-25002:
---
Summary: Avro: revise the output record namespace  (was: Avro: revise the 
output namespace)

> Avro: revise the output record namespace
> 
>
> Key: SPARK-25002
> URL: https://issues.apache.org/jira/browse/SPARK-25002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently the output namespace is starting with "."
> Although it is valid according to Avro spec, we should remove the starting 
> dot in case of failures when the output file is read by other lib:
> https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25002) Avro: revise the output namespace

2018-08-02 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-25002:
--

 Summary: Avro: revise the output namespace
 Key: SPARK-25002
 URL: https://issues.apache.org/jira/browse/SPARK-25002
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


Currently the output namespace is starting with "."

Although it is valid according to Avro spec, we should remove the starting dot 
in case of failures when the output file is read by other lib:

https://github.com/linkedin/goavro/pull/96



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25001) Handle build warnings in common, core, launcher, mllib, sql

2018-08-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25001:
-
Summary: Handle build warnings in common, core, launcher, mllib, sql  (was: 
Remove build warnings in common, core, launcher, mllib, sql)

> Handle build warnings in common, core, launcher, mllib, sql
> ---
>
> Key: SPARK-25001
> URL: https://issues.apache.org/jira/browse/SPARK-25001
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> There are many warnings in the current build (for instance see 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console).
> {code}
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237:
>  warning: [rawtypes] found raw type: LevelDBIterator
> [warn]   void closeIterator(LevelDBIterator it) throws IOException {
> [warn]  ^
> [warn]   missing type arguments for generic class LevelDBIterator
> [warn]   where T is a type-variable:
> [warn] T extends Object declared in class LevelDBIterator
> [warn] 1 warning
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] Pruning sources from previous analysis, due to incompatible 
> CompileSetup.
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.group() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152:
>  warning: [deprecation] group() in AbstractBootstrap has been deprecated
> [warn]   bootstrap.group().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]   ^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155:
>  warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
> [warn]   bootstrap.childGroup().shutdownGracefully();
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112:
>  warning: [deprecation] 
> PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in 
> PooledByteBufAllocator has been deprecated
> [warn] return new PooledByteBufAllocator(
> [warn]^
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321:
>  warning: [rawtypes] found raw type: Future
> [warn] public void operationComplete(Future future) throws Exception {
> [warn]   ^
> [warn]   missing type arguments for generic class Future
> [warn]   where V is a type-variable:
> [warn] V extends Object declared in interface Future
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
>  warning: [rawtypes] found raw type: StreamInterceptor
> [warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
> resp.streamId, resp.byteCount,
> [warn]   ^
> [warn]   missing type arguments for generic class StreamInterceptor
> [warn]   where T is a type-variable:
> [warn] T extends Message declared in class StreamInterceptor
> [warn] 
>

[jira] [Created] (SPARK-25001) Remove build warnings in common, core, launcher, mllib, sql

2018-08-02 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-25001:


 Summary: Remove build warnings in common, core, launcher, mllib, 
sql
 Key: SPARK-25001
 URL: https://issues.apache.org/jira/browse/SPARK-25001
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


There are many warnings in the current build (for instance see 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4734/console).

{code}
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDB.java:237:
 warning: [rawtypes] found raw type: LevelDBIterator
[warn]   void closeIterator(LevelDBIterator it) throws IOException {
[warn]  ^
[warn]   missing type arguments for generic class LevelDBIterator
[warn]   where T is a type-variable:
[warn] T extends Object declared in class LevelDBIterator
[warn] 1 warning
[warn] Pruning sources from previous analysis, due to incompatible CompileSetup.
[warn] Pruning sources from previous analysis, due to incompatible CompileSetup.

[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:151:
 warning: [deprecation] group() in AbstractBootstrap has been deprecated
[warn] if (bootstrap != null && bootstrap.group() != null) {
[warn]   ^
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:152:
 warning: [deprecation] group() in AbstractBootstrap has been deprecated
[warn]   bootstrap.group().shutdownGracefully();
[warn]^
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:154:
 warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
[warn] if (bootstrap != null && bootstrap.childGroup() != null) {
[warn]   ^
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/server/TransportServer.java:155:
 warning: [deprecation] childGroup() in ServerBootstrap has been deprecated
[warn]   bootstrap.childGroup().shutdownGracefully();
[warn]^

[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java:112:
 warning: [deprecation] 
PooledByteBufAllocator(boolean,int,int,int,int,int,int,int) in 
PooledByteBufAllocator has been deprecated
[warn] return new PooledByteBufAllocator(
[warn]^

[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportClient.java:321:
 warning: [rawtypes] found raw type: Future
[warn] public void operationComplete(Future future) throws Exception {
[warn]   ^
[warn]   missing type arguments for generic class Future
[warn]   where V is a type-variable:
[warn] V extends Object declared in interface Future
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
 warning: [rawtypes] found raw type: StreamInterceptor
[warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
resp.streamId, resp.byteCount,
[warn]   ^
[warn]   missing type arguments for generic class StreamInterceptor
[warn]   where T is a type-variable:
[warn] T extends Message declared in class StreamInterceptor
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
 warning: [rawtypes] found raw type: StreamInterceptor
[warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
resp.streamId, resp.byteCount,
[warn]   ^
[warn]   missing type arguments for generic class StreamInterceptor
[warn]   where T is a type-variable:
[warn] T extends Message declared in class StreamInterceptor
[warn] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java:215:
 warning: [unchecked] unchecked call to 
StreamInterceptor(MessageHandler,String,long,StreamCallback) as a member of 
the raw type StreamInterceptor
[warn]   StreamInterceptor interceptor = new StreamInterceptor(this, 
resp.streamId, resp.byteCount,
[warn]   ^
[warn]   where T is a type-variable:
[warn] T extends

[jira] [Updated] (SPARK-24940) Coalesce Hint for SQL Queries

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24940:

Target Version/s: 2.4.0

> Coalesce Hint for SQL Queries
> -
>
> Key: SPARK-24940
> URL: https://issues.apache.org/jira/browse/SPARK-24940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: John Zhuge
>Priority: Major
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition\(n\) or coalesce(n, shuffle) that require them to write and 
> deploy Scala/Java/Python code.
>   
>  There are use cases to either reduce or increase the number.
>   
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce hint to Spark SQL.
> {noformat}
> /*+ COALESCE(n, shuffle) */
> /*+ REPARTITION(n) */
> {noformat}
> REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Kildiev Rustam (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567062#comment-16567062
 ] 

Kildiev Rustam commented on SPARK-14220:


Hurra

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24924) Add mapping for built-in Avro data source

2018-08-02 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567045#comment-16567045
 ] 

Thomas Graves commented on SPARK-24924:
---

why are we doing this? If a user ships the spark-avro databricks jar and 
references the com.databricks.spark.avro class, why do we want to map that to 
our built in version which might be different?

> Add mapping for built-in Avro data source
> -
>
> Key: SPARK-24924
> URL: https://issues.apache.org/jira/browse/SPARK-24924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24821.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21927
[https://github.com/apache/spark/pull/21927]

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24821:
-

Assignee: Jiang Xingbo

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24820:
-

Assignee: Jiang Xingbo

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24820.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21927
[https://github.com/apache/spark/pull/21927]

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24598) SPARK SQL:Datatype overflow conditions gives incorrect result

2018-08-02 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24598.
-
   Resolution: Done
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> SPARK SQL:Datatype overflow conditions gives incorrect result
> -
>
> Key: SPARK-24598
> URL: https://issues.apache.org/jira/browse/SPARK-24598
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: navya
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Execute an sql query, so that it results in overflow conditions. 
> EX - SELECT 9223372036854775807 + 1 result = -9223372036854776000
>  
> Expected result - Error should be throw like mysql. 
> mysql> SELECT 9223372036854775807 + 1;
> ERROR 1690 (22003): BIGINT value is out of range in '(9223372036854775807 + 
> 1)'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-02 Thread Rob Vesse (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567017#comment-16567017
 ] 

Rob Vesse commented on SPARK-24434:
---

[~skonto] Added a couple more comments based on some issues I've run into 
during ongoing development

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-02 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567015#comment-16567015
 ] 

Sean Owen commented on SPARK-18057:
---

[~zsxwing] hm, it seems weird that Spark is then using two incompatible 
versions of Kafka at once. An app that used Spark Streaming and SQL with Kafka 
wouldn't work, it seem? I also think it's probably pretty easy to update the 
tests – I got it maybe 80% of the way there already, just not sure how to use 
the internal Log API.

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Blocker
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0

2018-08-02 Thread Shixiong Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567010#comment-16567010
 ] 

Shixiong Zhu commented on SPARK-18057:
--

[~srowen] Could you create a new ticket for DStreams Kafka? IMO, they are two 
different modules and don't have to upgrade in the same version.

> Update structured streaming kafka from 0.10.0.1 to 2.0.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Ted Yu
>Priority: Blocker
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-02 Thread Joseph Fourny (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981
 ] 

Joseph Fourny edited comment on SPARK-24826 at 8/2/18 3:57 PM:
---

I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. If 
you update one of the datasets to make them distinct, then the assertion is not 
longer triggered. Note: I have to force a SortMergeJoin to trigger the issue in 
ExchangeCoordinator. 


was (Author: josephfourny):
I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. Note: 
I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. 

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
>

[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-08-02 Thread Joseph Fourny (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566981#comment-16566981
 ] 

Joseph Fourny commented on SPARK-24826:
---

I was able to reproduce this defect with an inner-join of two temp views that 
refer to equivalent local relations. I started by creating 2 datasets (in Java) 
from a List of GenericRow and registered them as separate views. As far as the 
optimizer is concerned, the contents of the local relations are the same. Note: 
I have to force a SortMergeJoin to trigger the issue in ExchangeCoordinator. 

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michail Giannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
>  Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
> member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
> int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, 
> emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, 
> issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, 
> title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more 
> fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
>

[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-08-02 Thread Jackey Lee (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566954#comment-16566954
 ] 

Jackey Lee commented on SPARK-24630:


[~uncleGen] Are you willing to assist in the code review? I can submit some of 
the implementation.

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566948#comment-16566948
 ] 

Simeon H.K. Fitch commented on SPARK-14540:
---

Congratulations! A long, difficult haul... Cheers all around!

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Stavros Kontopoulos
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds

2018-08-02 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566949#comment-16566949
 ] 

shane knapp commented on SPARK-24980:
-

looking pretty good (you have to go to the very bottom of the 16M build log to 
get to the output) : 

[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/896/consoleFull]

summary:
 * py27 and py34 pandas and pyarrow tests run
 * pypy doesn't have any additional libs installed, plus the version we're 
running is pretty old – 2.5.1. the following tests were skipped:  pandas, 
pyarrow, numpy, scipy
 * i will need to bump the unittest py2.7 lib to >= 3.3 to support mocks
 * we should really think about bumping python3 from 3.4 to 3.5 at some point 
in the not-too-distant-future

 

> add support for pandas/arrow etc for python2.7 and pypy builds
> --
>
> Key: SPARK-24980
> URL: https://issues.apache.org/jira/browse/SPARK-24980
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> since we have full support for python3.4 via anaconda, it's time to create 
> similar environments for 2.7 and pypy 2.5.1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566945#comment-16566945
 ] 

Simeon H.K. Fitch commented on SPARK-14220:
---

(flag)(*)(*r)(*g)(*b)(*y):D(*y)(*b)(*g)(*r)(*)(flag)

Way to go! This is amazing.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon H.K. Fitch updated SPARK-14220:
--
Comment: was deleted

(was: (flag)(*)(*r)(*g)(*b)(*y):D(*y)(*b)(*g)(*r)(*)(flag)

Way to go! This is amazing.)

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-02 Thread Simeon H.K. Fitch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566942#comment-16566942
 ] 

Simeon H.K. Fitch commented on SPARK-14220:
---

(flag)(*)(*r)(*g)(*b)(*y):D(*y)(*b)(*g)(*r)(*)(flag)

Way to go! This is amazing.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24795) Implement barrier execution mode

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566935#comment-16566935
 ] 

Apache Spark commented on SPARK-24795:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21972

> Implement barrier execution mode
> 
>
> Key: SPARK-24795
> URL: https://issues.apache.org/jira/browse/SPARK-24795
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Implement barrier execution mode, as described in SPARK-24582
> Include all the API changes and basic implementation (except for 
> BarrierTaskContext.barrier())



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24947) aggregateAsync and foldAsync for RDD

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24947:


Assignee: (was: Apache Spark)

> aggregateAsync and foldAsync for RDD
> 
>
> Key: SPARK-24947
> URL: https://issues.apache.org/jira/browse/SPARK-24947
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Cody Allen
>Priority: Minor
>
> {{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
> {{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
> datasets asynchronously. If I want to aggregate some statistics on a large 
> dataset and it's going to take an hour, I shouldn't need to completely block 
> a thread for the hour to wait for the result.
>  
> I propose the following methods be added to {{AsyncRDDActions}}:
>  
> {{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => 
> U): FutureAction[U]}}
> {{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}
>  
> Locally I have a version of {{aggregateAsync}} implemented based on 
> {{submitJob}} (similar to how {{countAsync}} is implemented), and a 
> {{foldAsync}} implementation that simply delegates through to 
> {{aggregateAsync}}. I haven't yet written unit tests for these, but I can do 
> so if this is a contribution that would be accepted. Please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24947) aggregateAsync and foldAsync for RDD

2018-08-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24947:


Assignee: Apache Spark

> aggregateAsync and foldAsync for RDD
> 
>
> Key: SPARK-24947
> URL: https://issues.apache.org/jira/browse/SPARK-24947
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Cody Allen
>Assignee: Apache Spark
>Priority: Minor
>
> {{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
> {{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
> datasets asynchronously. If I want to aggregate some statistics on a large 
> dataset and it's going to take an hour, I shouldn't need to completely block 
> a thread for the hour to wait for the result.
>  
> I propose the following methods be added to {{AsyncRDDActions}}:
>  
> {{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => 
> U): FutureAction[U]}}
> {{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}
>  
> Locally I have a version of {{aggregateAsync}} implemented based on 
> {{submitJob}} (similar to how {{countAsync}} is implemented), and a 
> {{foldAsync}} implementation that simply delegates through to 
> {{aggregateAsync}}. I haven't yet written unit tests for these, but I can do 
> so if this is a contribution that would be accepted. Please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24947) aggregateAsync and foldAsync for RDD

2018-08-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566867#comment-16566867
 ] 

Apache Spark commented on SPARK-24947:
--

User 'ceedubs' has created a pull request for this issue:
https://github.com/apache/spark/pull/21971

> aggregateAsync and foldAsync for RDD
> 
>
> Key: SPARK-24947
> URL: https://issues.apache.org/jira/browse/SPARK-24947
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Cody Allen
>Priority: Minor
>
> {{AsyncRDDActions}} contains {{collectAsync}}, {{countAsync}}, 
> {{foreachAsync}}, etc; but it doesn't provide general mechanisms for reducing 
> datasets asynchronously. If I want to aggregate some statistics on a large 
> dataset and it's going to take an hour, I shouldn't need to completely block 
> a thread for the hour to wait for the result.
>  
> I propose the following methods be added to {{AsyncRDDActions}}:
>  
> {{def aggregateAsync[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => 
> U): FutureAction[U]}}
> {{def foldAsync(zeroValue: T)(op: (T, T) => T): FutureAction[T]}}
>  
> Locally I have a version of {{aggregateAsync}} implemented based on 
> {{submitJob}} (similar to how {{countAsync}} is implemented), and a 
> {{foldAsync}} implementation that simply delegates through to 
> {{aggregateAsync}}. I haven't yet written unit tests for these, but I can do 
> so if this is a contribution that would be accepted. Please let me know.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 150 matches

Mail list logo