date:20161022

[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-22 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598986#comment-15598986
 ] 

Wenchen Fan commented on SPARK-18053:
-

[~lian cheng] are you sure this issue exists in 2.0? The new array format is 
only merged into master branch(2.1)

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18035) Introduce performant and memory efficient APIs to create ArrayBasedMapData

2016-10-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18035:

Assignee: Tejas Patil

> Introduce performant and memory efficient APIs to create ArrayBasedMapData
> --
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18035) Introduce performant and memory efficient APIs to create ArrayBasedMapData

2016-10-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18035.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15573
[https://github.com/apache/spark/pull/15573]

> Introduce performant and memory efficient APIs to create ArrayBasedMapData
> --
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18035) Introduce performant and memory efficient APIs to create ArrayBasedMapData

2016-10-22 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-18035:

Summary: Introduce performant and memory efficient APIs to create 
ArrayBasedMapData  (was: Unwrapping java maps in HiveInspectors allocates 
unnecessary buffer)

> Introduce performant and memory efficient APIs to create ArrayBasedMapData
> --
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Minor
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17698) Join predicates should not contain filter clauses

2016-10-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17698:

Fix Version/s: 2.0.2

> Join predicates should not contain filter clauses
> -
>
> Key: SPARK-17698
> URL: https://issues.apache.org/jira/browse/SPARK-17698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join 
> condition for joins. While this does not lead to incorrect results but in 
> case of bucketed + sorted tables, we might miss out on avoiding un-necessary 
> shuffle + sort. eg.
> {code}
> val df = (1 until 10).toDF("id").coalesce(1)
> hc.sql("DROP TABLE IF EXISTS table1").collect
> df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
> hc.sql("DROP TABLE IF EXISTS table2").collect
> df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")
> sqlContext.sql("""
>   SELECT a.id, b.id
>   FROM table1 a
>   FULL OUTER JOIN table2 b
>   ON a.id = b.id AND a.id='1' AND b.id='1'
> """).explain(true)
> {code}
> This is doing shuffle + sort over table scan outputs which is not needed as 
> both tables are bucketed and sorted on the same columns and have same number 
> of buckets. This should be a single stage job.
> {code}
> SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as 
> double)], FullOuter
> :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 
> ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
> : +- *FileScan parquet default.table1[id#38] Batched: true, Format: 
> ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) 
> ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
>   +- *FileScan parquet default.table2[id#39] Batched: true, Format: 
> ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-22 Thread Aral Can Kaymaz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598448#comment-15598448
 ] 

Aral Can Kaymaz commented on SPARK-16845:
-

I am currently out of office, and will be back on Tuesday, 1st of November, 
2016 (01.11.2016).

Kind regards,
Aral Can Kaymaz



> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-22 Thread Don Drake (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598442#comment-15598442
 ] 

Don Drake commented on SPARK-16845:
---

Update: 

It turns out that I am still getting this exception.  I'll try to create a test 
case to duplicate it. Basically, I'm exploding a nested datastructure, then 
doing a union and then saving to Parquet.  The resulting table has over 400 
columns.

I verified in spark-shell the exceptions do not occur with the test cases 
provided.

Can you point me to your other solution? I can see if that works.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17698) Join predicates should not contain filter clauses

2016-10-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598437#comment-15598437
 ] 

Apache Spark commented on SPARK-17698:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15600

> Join predicates should not contain filter clauses
> -
>
> Key: SPARK-17698
> URL: https://issues.apache.org/jira/browse/SPARK-17698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join 
> condition for joins. While this does not lead to incorrect results but in 
> case of bucketed + sorted tables, we might miss out on avoiding un-necessary 
> shuffle + sort. eg.
> {code}
> val df = (1 until 10).toDF("id").coalesce(1)
> hc.sql("DROP TABLE IF EXISTS table1").collect
> df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
> hc.sql("DROP TABLE IF EXISTS table2").collect
> df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")
> sqlContext.sql("""
>   SELECT a.id, b.id
>   FROM table1 a
>   FULL OUTER JOIN table2 b
>   ON a.id = b.id AND a.id='1' AND b.id='1'
> """).explain(true)
> {code}
> This is doing shuffle + sort over table scan outputs which is not needed as 
> both tables are bucketed and sorted on the same columns and have same number 
> of buckets. This should be a single stage job.
> {code}
> SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as 
> double)], FullOuter
> :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 
> ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
> : +- *FileScan parquet default.table1[id#38] Batched: true, Format: 
> ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) 
> ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
>   +- *FileScan parquet default.table2[id#39] Batched: true, Format: 
> ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18038:

Target Version/s: 2.1.0

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18038:

Assignee: Tejas Patil

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work

2016-10-22 Thread Yuehua Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598339#comment-15598339
 ] 

Yuehua Zhang commented on SPARK-18017:
--

Thanks for your input! I only want to change the parameter for one job, so I 
can't edit Hadoop config file. For the other option, if i add it through 
spark-submit command i will get "Warning: Ignoring non-spark config property: 
fs.s3n.block.size=524288000". 
The reason i am thinking this is related to spark upgrade is because this 
setting worked well on Spark 1.6 but stopped working after we upgraded to Spark 
2.0.

> Changing Hadoop parameter through 
> sparkSession.sparkContext.hadoopConfiguration doesn't work
> 
>
> Key: SPARK-18017
> URL: https://issues.apache.org/jira/browse/SPARK-18017
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Scala version 2.11.8; Java 1.8.0_91; 
> com.databricks:spark-csv_2.11:1.2.0
>Reporter: Yuehua Zhang
>
> My Spark job tries to read csv files on S3. I need to control the number of 
> partitions created so I set Hadoop parameter fs.s3n.block.size. However, it 
> stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is 
> related to https://issues.apache.org/jira/browse/SPARK-15991. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-10-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-928.
---
   Resolution: Fixed
 Assignee: Sandeep Singh
Fix Version/s: 2.1.0

> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Sandeep Singh
>Priority: Minor
>  Labels: starter
> Fix For: 2.1.0
>
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18051) Custom PartitionCoalescer cause serialization exception

2016-10-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18051.
-
   Resolution: Fixed
 Assignee: Weichen Xu
Fix Version/s: 2.1.0

> Custom PartitionCoalescer cause serialization exception
> ---
>
> Key: SPARK-18051
> URL: https://issues.apache.org/jira/browse/SPARK-18051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> for example, the following code cause exception:
> {code: title=code}
> class MyCoalescer extends PartitionCoalescer{
>   override def coalesce(maxPartitions: Int, parent: RDD[_]): 
> Array[PartitionGroup] = {
> val pglist = Array.fill(2)(new PartitionGroup())
> pglist(0).partitions.append(parent.partitions(0), parent.partitions(1), 
> parent.partitions(2))
> pglist(1).partitions.append(parent.partitions(3), parent.partitions(4), 
> parent.partitions(5))
> pglist
>   }
> }
> object Test1 {
>   def main(args: Array[String]) = {
> val spark = SparkSession.builder().appName("test").getOrCreate()
> val sc = spark.sparkContext
> val rdd = sc.parallelize(1 to 6, 6)
> rdd.coalesce(2, false, Some(new MyCoalescer)).count()
> spark.stop()
>   }
> }
> {code}
> it throws exception:
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
> at java.lang.Exception.(Exception.java:102)
> 
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:230)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:108)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializableWithWriteObjectMethod(SerializationDebugger.scala:243)
> at 
> org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:189)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException

2016-10-22 Thread David Shar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598281#comment-15598281
 ] 

David Shar edited comment on SPARK-18027 at 10/22/16 6:36 PM:
--

I believe it there is a major difference between the 2 exceptions above.
1. ApplicationNotFoundException means there is no such running app according to 
Yarn and it is safe to cleanup.
2. NonFatal, fail to connect to Yarn, we can't be sure that the app is running 
or not, so we cannot be safe cleaning up.

Therefore, just add cleanup for the first exception.


was (Author: davidshar):
I believe it there is a major difference between the 2 exceptions above.
1. ApplicationNotFoundException means there is no such running app according to 
Yarn and it is safe to cleanup.
2. NonFatal, fail to connect to Yarn, we can't be sure that the app is running 
or not, so we cannot be safe cleaning up.

> .sparkStaging not clean on RM ApplicationNotFoundException
> --
>
> Key: SPARK-18027
> URL: https://issues.apache.org/jira/browse/SPARK-18027
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: David Shar
>Priority: Minor
>
> Hi,
> It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder 
> cleanup.
> in Client.scala:monitorApplication 
> {code}
>  val report: ApplicationReport =
> try {
>   getApplicationReport(appId)
> } catch {
>   case e: ApplicationNotFoundException =>
> logError(s"Application $appId not found.")
> return (YarnApplicationState.KILLED, 
> FinalApplicationStatus.KILLED)
>   case NonFatal(e) =>
> logError(s"Failed to contact YARN for application $appId.", e)
> return (YarnApplicationState.FAILED, 
> FinalApplicationStatus.FAILED)
> }
> 
> if (state == YarnApplicationState.FINISHED ||
> state == YarnApplicationState.FAILED ||
> state == YarnApplicationState.KILLED) {
> cleanupStagingDir(appId)
> return (state, report.getFinalApplicationStatus)
>  }
> {code}
> In case of ApplicationNotFoundException, we don't cleanup the sparkStaging 
> folder.
> I believe we should call cleanupStagingDir(appId) on the catch clause above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException

2016-10-22 Thread David Shar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15598281#comment-15598281
 ] 

David Shar commented on SPARK-18027:


I believe it there is a major difference between the 2 exceptions above.
1. ApplicationNotFoundException means there is no such running app according to 
Yarn and it is safe to cleanup.
2. NonFatal, fail to connect to Yarn, we can't be sure that the app is running 
or not, so we cannot be safe cleaning up.

> .sparkStaging not clean on RM ApplicationNotFoundException
> --
>
> Key: SPARK-18027
> URL: https://issues.apache.org/jira/browse/SPARK-18027
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: David Shar
>Priority: Minor
>
> Hi,
> It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder 
> cleanup.
> in Client.scala:monitorApplication 
> {code}
>  val report: ApplicationReport =
> try {
>   getApplicationReport(appId)
> } catch {
>   case e: ApplicationNotFoundException =>
> logError(s"Application $appId not found.")
> return (YarnApplicationState.KILLED, 
> FinalApplicationStatus.KILLED)
>   case NonFatal(e) =>
> logError(s"Failed to contact YARN for application $appId.", e)
> return (YarnApplicationState.FAILED, 
> FinalApplicationStatus.FAILED)
> }
> 
> if (state == YarnApplicationState.FINISHED ||
> state == YarnApplicationState.FAILED ||
> state == YarnApplicationState.KILLED) {
> cleanupStagingDir(appId)
> return (state, report.getFinalApplicationStatus)
>  }
> {code}
> In case of ApplicationNotFoundException, we don't cleanup the sparkStaging 
> folder.
> I believe we should call cleanupStagingDir(appId) on the catch clause above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile

2016-10-22 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17123.
---
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.1.0

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17994:

Assignee: Eric Liang

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17994.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15539
[https://github.com/apache/spark/pull/15539]

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
> Fix For: 2.1.0
>
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-10-22 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597858#comment-15597858
 ] 

Herman van Hovell commented on SPARK-17074:
---

[~ZenWzh] I think your current approach is valid. I will take two passes, but 
that is fine for now.

I have discussed this with Tim and we are going to see if we can come up with 
something for a single pass algorithm. But that is going to be somewhere in the 
next week.

Please also note that we are currently doing some work on the aggregation code 
paths. This might make your effort a little easier.

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  
> We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms 
> (for both numeric and string types) or endpoints of equi-height histograms 
> (for numeric type only). Then, if we get endpoints of a equi-height 
> histogram, we need to compute ndv's between those endpoints by [SPARK-17997] 
> to form the equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed 
> aggregation functions. We need to resolve them before this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-10-22 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597816#comment-15597816
 ] 

Barry Becker commented on SPARK-18054:
--

Ah. That is quite likely the problem. I will verify next week. A simple fix for 
this then, would be to include the package name of the 2 classes in the error 
message.

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
>   at 
>

[jira] [Comment Edited] (SPARK-17074) generate histogram information for column

2016-10-22 Thread Zhenhua Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570366#comment-15570366
 ] 

Zhenhua Wang edited comment on SPARK-17074 at 10/22/16 12:52 PM:
-

Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and our code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for counting ndv's of multiple intervals is already 
sent.


was (Author: zenwzh):
Well, I've got stuck here for a few days. I went through the QuantileSummaries 
paper and our code in Spark, and I still don't have any clue how to implement 
the second method and get its bounds.
So I decide to adopt the first method for now, such that it won't block our 
progress on CBO work. We can implement the other one in the future.
A PR for a new agg function for string histogram (equi-width) is already sent. 
I'll start to work on this one today and send a pr in the following days. 
Thanks!

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  
> We first use [SPARK-18000] and [SPARK-17881] to compute equi-width histograms 
> (for both numeric and string types) or endpoints of equi-height histograms 
> (for numeric type only). Then, if we get endpoints of a equi-height 
> histogram, we need to compute ndv's between those endpoints by [SPARK-17997] 
> to form the equi-height histogram.
> This Jira incorporates three Jiras mentioned above to support needed 
> aggregation functions. We need to resolve them before this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-22 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17986:

Affects Version/s: (was: 2.0.1)

> SQLTransformer leaks temporary tables
> -
>
> Key: SPARK-17986
> URL: https://issues.apache.org/jira/browse/SPARK-17986
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Drew Robb
>Assignee: Drew Robb
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> The SQLTransformer creates a temporary table when called, and does not delete 
> this temporary table. When using a SQLTransformer in a long running Spark 
> Streaming task, these temporary tables accumulate.
> I believe that the fix would be as simple as calling  
> `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
> `transform`:
> https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17986) SQLTransformer leaks temporary tables

2016-10-22 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17986.
-
   Resolution: Fixed
 Assignee: Drew Robb
Fix Version/s: 2.1.0
   2.0.2

> SQLTransformer leaks temporary tables
> -
>
> Key: SPARK-17986
> URL: https://issues.apache.org/jira/browse/SPARK-17986
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>Assignee: Drew Robb
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> The SQLTransformer creates a temporary table when called, and does not delete 
> this temporary table. When using a SQLTransformer in a long running Spark 
> Streaming task, these temporary tables accumulate.
> I believe that the fix would be as simple as calling  
> `dataset.sparkSession.catalog.dropTempView(tableName)` in the last part of 
> `transform`:
> https://github.com/apache/spark/blob/v2.0.1/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala#L65.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18022:


Assignee: Apache Spark

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597477#comment-15597477
 ] 

Apache Spark commented on SPARK-18022:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15599

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18022:


Assignee: (was: Apache Spark)

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException

2016-10-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18027:


Assignee: Apache Spark

> .sparkStaging not clean on RM ApplicationNotFoundException
> --
>
> Key: SPARK-18027
> URL: https://issues.apache.org/jira/browse/SPARK-18027
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: David Shar
>Assignee: Apache Spark
>Priority: Minor
>
> Hi,
> It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder 
> cleanup.
> in Client.scala:monitorApplication 
> {code}
>  val report: ApplicationReport =
> try {
>   getApplicationReport(appId)
> } catch {
>   case e: ApplicationNotFoundException =>
> logError(s"Application $appId not found.")
> return (YarnApplicationState.KILLED, 
> FinalApplicationStatus.KILLED)
>   case NonFatal(e) =>
> logError(s"Failed to contact YARN for application $appId.", e)
> return (YarnApplicationState.FAILED, 
> FinalApplicationStatus.FAILED)
> }
> 
> if (state == YarnApplicationState.FINISHED ||
> state == YarnApplicationState.FAILED ||
> state == YarnApplicationState.KILLED) {
> cleanupStagingDir(appId)
> return (state, report.getFinalApplicationStatus)
>  }
> {code}
> In case of ApplicationNotFoundException, we don't cleanup the sparkStaging 
> folder.
> I believe we should call cleanupStagingDir(appId) on the catch clause above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15042) ConnectedComponents fails to compute graph with 200 vertices (but long paths)

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15042.
---
Resolution: Cannot Reproduce

Provisionally closing as it may have been fixed in between these versions.

> ConnectedComponents fails to compute graph with 200 vertices (but long paths)
> -
>
> Key: SPARK-15042
> URL: https://issues.apache.org/jira/browse/SPARK-15042
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
> Environment: Local cluster (1 instance) running on Arch Linux
> Scala 2.11.7, Java 1.8.0_92
>Reporter: Philipp Claßen
>
> ConnectedComponents takes forever and eventually fails with OutOfMemory when 
> computing this graph: {code}{ (i, i+1) | i <- { 1..200 } }{code}
> If you generate the example graph, e.g., with this bash command
> {code}
> for i in {1..200} ; do echo "$i $(($i+1))" ; done > input.graph
> {code}
> ... then should be able to reproduce in the spark-shell by running:
> {code}
> import org.apache.spark.graphx._
> import org.apache.spark.graphx.lib._
> val graph = GraphLoader.edgeListFile(sc, "input.graph").cache()
> ConnectedComponents.run(graph)
> {code}
> I seems to take forever, and spawns these warnings from time to time:
> {code}
> 16/04/30 20:06:24 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@7af98fbd,BlockManagerId(driver, localhost, 
> 43440))] in 1 attempts
> {code}
> For additional information, here is a link to my related question on 
> Stackoverflow:
> http://stackoverflow.com/q/36892272/783510
> One comment so far, was that the number of skipping tasks grows exponentially.
> ---
> Here is the complete output of a spark-shell session:
> {noformat}
> phil@terra-arch:~/tmp/spark-graph$ spark-shell 
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's repl log4j profile: 
> org/apache/spark/log4j-defaults-repl.properties
> To adjust logging level use sc.setLogLevel("INFO")
> Spark context available as sc.
> SQL context available as sqlContext.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
>  
> Using Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.8.0_92)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.apache.spark.graphx._
> import org.apache.spark.graphx._
> scala> import org.apache.spark.graphx.lib._
> import org.apache.spark.graphx.lib._
> scala> 
> scala> val graph = GraphLoader.edgeListFile(sc, "input.graph").cache()
> graph: org.apache.spark.graphx.Graph[Int,Int] = 
> org.apache.spark.graphx.impl.GraphImpl@1fa9692b
> scala> ConnectedComponents.run(graph)
> 16/04/30 20:05:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@50432fd2,BlockManagerId(driver, localhost, 
> 43440))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>

[jira] [Assigned] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException

2016-10-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18027:


Assignee: (was: Apache Spark)

> .sparkStaging not clean on RM ApplicationNotFoundException
> --
>
> Key: SPARK-18027
> URL: https://issues.apache.org/jira/browse/SPARK-18027
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: David Shar
>Priority: Minor
>
> Hi,
> It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder 
> cleanup.
> in Client.scala:monitorApplication 
> {code}
>  val report: ApplicationReport =
> try {
>   getApplicationReport(appId)
> } catch {
>   case e: ApplicationNotFoundException =>
> logError(s"Application $appId not found.")
> return (YarnApplicationState.KILLED, 
> FinalApplicationStatus.KILLED)
>   case NonFatal(e) =>
> logError(s"Failed to contact YARN for application $appId.", e)
> return (YarnApplicationState.FAILED, 
> FinalApplicationStatus.FAILED)
> }
> 
> if (state == YarnApplicationState.FINISHED ||
> state == YarnApplicationState.FAILED ||
> state == YarnApplicationState.KILLED) {
> cleanupStagingDir(appId)
> return (state, report.getFinalApplicationStatus)
>  }
> {code}
> In case of ApplicationNotFoundException, we don't cleanup the sparkStaging 
> folder.
> I believe we should call cleanupStagingDir(appId) on the catch clause above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18027) .sparkStaging not clean on RM ApplicationNotFoundException

2016-10-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597470#comment-15597470
 ] 

Apache Spark commented on SPARK-18027:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15598

> .sparkStaging not clean on RM ApplicationNotFoundException
> --
>
> Key: SPARK-18027
> URL: https://issues.apache.org/jira/browse/SPARK-18027
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: David Shar
>Priority: Minor
>
> Hi,
> It seems that SPARK-7705 didn't fix all issues with .sparkStaging folder 
> cleanup.
> in Client.scala:monitorApplication 
> {code}
>  val report: ApplicationReport =
> try {
>   getApplicationReport(appId)
> } catch {
>   case e: ApplicationNotFoundException =>
> logError(s"Application $appId not found.")
> return (YarnApplicationState.KILLED, 
> FinalApplicationStatus.KILLED)
>   case NonFatal(e) =>
> logError(s"Failed to contact YARN for application $appId.", e)
> return (YarnApplicationState.FAILED, 
> FinalApplicationStatus.FAILED)
> }
> 
> if (state == YarnApplicationState.FINISHED ||
> state == YarnApplicationState.FAILED ||
> state == YarnApplicationState.KILLED) {
> cleanupStagingDir(appId)
> return (state, report.getFinalApplicationStatus)
>  }
> {code}
> In case of ApplicationNotFoundException, we don't cleanup the sparkStaging 
> folder.
> I believe we should call cleanupStagingDir(appId) on the catch clause above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18062) ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should return probabilities when given all-0 vector

2016-10-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597464#comment-15597464
 ] 

Sean Owen commented on SPARK-18062:
---

In the spirit of throwing errors on invalid input, shouldn't this be an error 
(perhaps earlier)? the input says all classes are impossible, and normalization 
can't change that.

> ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace should 
> return probabilities when given all-0 vector
> 
>
> Key: SPARK-18062
> URL: https://issues.apache.org/jira/browse/SPARK-18062
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> {{ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace}} returns 
> a vector of all-0 when given a rawPrediction vector of all-0.  It should 
> return a valid probability vector with the uniform distribution.
> Note: This will be a *behavior change* but it should be very minor and affect 
> few if any users.  But we should note it in release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18052) Spark Job failing with org.apache.spark.rpc.RpcTimeoutException

2016-10-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597456#comment-15597456
 ] 

Sean Owen commented on SPARK-18052:
---

This sounds like you have some env or network problem. I'm not aware of any 
problems of this form and there's no real info on a reproduction here, so I'd 
generally close this.

> Spark Job failing with org.apache.spark.rpc.RpcTimeoutException
> ---
>
> Key: SPARK-18052
> URL: https://issues.apache.org/jira/browse/SPARK-18052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
> Environment: 3 node spark cluster, all AWS r3.xlarge instances 
> running on ubuntu.
>Reporter: Srikanth
> Attachments: sparkErrorLog.txt
>
>
> Spark submit jobs are failing with org.apache.spark.rpc.RpcTimeoutException. 
> increased the spark.executor.heartbeatInterval value from 10s to 60s, but 
> still the same issue.
> This is happening while saving a dataframe to a mounted network drive. Not 
> using HDFS here. We are able to write successfully for smaller size files 
> under 10G, the data here we are reading is nearly 20G
> driver memory = 10G
> executor memory = 25G
> Please see the attached log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type

2016-10-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597454#comment-15597454
 ] 

Sean Owen commented on SPARK-18054:
---

Are you creating .ml vectors, or .mllib vectors? there are two vectors classes 
and I suspect you have one where the other is expected maybe.

> Unexpected error from UDF that gets an element of a vector: argument 1 
> requires vector type, however, '`_column_`' is of vector type
> 
>
> Key: SPARK-18054
> URL: https://issues.apache.org/jira/browse/SPARK-18054
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Barry Becker
>
> Not sure if this is a bug in ML or a more core part of spark.
> It used to work in spark 1.6.2, but now gives me an error.
> I have a pipeline that contains a NaiveBayesModel which I created like this
> {code}
> val nbModel = new NaiveBayes()
>   .setLabelCol(target)
>   .setFeaturesCol(FEATURES_COL)
>   .setPredictionCol(PREDICTION_COLUMN)
>   .setProbabilityCol("_probability_column_")
>   .setModelType("multinomial")
> {code}
> When I apply that pipeline to some data there will be a 
> "_probability_column_" of type vector. I want to extract a probability for a 
> specific class label using the following, but it no longer works.
> {code}
> var newDf = pipeline.transform(df)
> val extractProbability = udf((vector: DenseVector) => vector(1))
> val dfWithProbability = newDf.withColumn("foo", 
> extractProbability(col("_probability_column_")))
> {code}
> The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. 
> I consider this a strange error because its basically saying "argument 1 
> requires a vector, but we got a vector instead". That does not make any sense 
> to me. It wants a vector, and a vector was given. Why does it fail?
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 
> requires vector type, however, '`_class_probability_column__`' is of vector 
> type.;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at

[jira] [Updated] (SPARK-17898) --repositories needs username and password

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17898:
--
Fix Version/s: 2.1.0

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.1
>Reporter: lichenglin
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.0
>
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17898) --repositories needs username and password

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-17898:
---

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.1
>Reporter: lichenglin
>Assignee: Sean Owen
>Priority: Trivial
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17898) --repositories needs username and password

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17898.
---
Resolution: Fixed

Documented.

> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.1
>Reporter: lichenglin
>Assignee: Sean Owen
>Priority: Trivial
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17944.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15557
[https://github.com/apache/spark/pull/15557]

> sbin/start-* scripts use of `hostname -f` fail with Solaris 
> 
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Priority: Trivial
> Fix For: 2.1.0
>
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17944:
--
Assignee: Erik O'Shaughnessy

> sbin/start-* scripts use of `hostname -f` fail with Solaris 
> 
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Assignee: Erik O'Shaughnessy
>Priority: Trivial
> Fix For: 2.1.0
>
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18047) Spark worker port should be greater than 1023

2016-10-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18047.
---
Resolution: Not A Problem

> Spark worker port should be greater than 1023
> -
>
> Key: SPARK-18047
> URL: https://issues.apache.org/jira/browse/SPARK-18047
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1
>Reporter: darion yaphet
>
> The port numbers in the range from 0 to 1023 are the well-known ports (system 
> ports) . 
> They are widely used by system network services. Such as Telnet(23), Simple 
> Mail Transfer Protocol(25) and Domain Name System(53). 
> Work port should avoid using this ports . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18063) Failed to infer constraints over multiple aliases

2016-10-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18063:


Assignee: (was: Apache Spark)

> Failed to infer constraints over multiple aliases
> -
>
> Key: SPARK-18063
> URL: https://issues.apache.org/jira/browse/SPARK-18063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> The `UnaryNode.getAliasedConstraints` function fails to replace all 
> expressions by their alias where constraints contains more than one 
> expression to be replaced. For example:
> {code}
> val tr = LocalRelation('a.int, 'b.string, 'c.int)
> val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
> multiAlias.analyze.constraints
> {code}
> currently outputs:
> {code}
> ExpressionSet(Seq(
> IsNotNull(resolveColumn(multiAlias.analyze, "x")),
> IsNotNull(resolveColumn(multiAlias.analyze, "y"))
> )
> {code}
> The constraint {code}resolveColumn(multiAlias.analyze, "x") === 
> resolveColumn(multiAlias.analyze, "y") + 10){code} is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18063) Failed to infer constraints over multiple aliases

2016-10-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597304#comment-15597304
 ] 

Apache Spark commented on SPARK-18063:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15597

> Failed to infer constraints over multiple aliases
> -
>
> Key: SPARK-18063
> URL: https://issues.apache.org/jira/browse/SPARK-18063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> The `UnaryNode.getAliasedConstraints` function fails to replace all 
> expressions by their alias where constraints contains more than one 
> expression to be replaced. For example:
> {code}
> val tr = LocalRelation('a.int, 'b.string, 'c.int)
> val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
> multiAlias.analyze.constraints
> {code}
> currently outputs:
> {code}
> ExpressionSet(Seq(
> IsNotNull(resolveColumn(multiAlias.analyze, "x")),
> IsNotNull(resolveColumn(multiAlias.analyze, "y"))
> )
> {code}
> The constraint {code}resolveColumn(multiAlias.analyze, "x") === 
> resolveColumn(multiAlias.analyze, "y") + 10){code} is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18063) Failed to infer constraints over multiple aliases

2016-10-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18063:


Assignee: Apache Spark

> Failed to infer constraints over multiple aliases
> -
>
> Key: SPARK-18063
> URL: https://issues.apache.org/jira/browse/SPARK-18063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> The `UnaryNode.getAliasedConstraints` function fails to replace all 
> expressions by their alias where constraints contains more than one 
> expression to be replaced. For example:
> {code}
> val tr = LocalRelation('a.int, 'b.string, 'c.int)
> val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
> multiAlias.analyze.constraints
> {code}
> currently outputs:
> {code}
> ExpressionSet(Seq(
> IsNotNull(resolveColumn(multiAlias.analyze, "x")),
> IsNotNull(resolveColumn(multiAlias.analyze, "y"))
> )
> {code}
> The constraint {code}resolveColumn(multiAlias.analyze, "x") === 
> resolveColumn(multiAlias.analyze, "y") + 10){code} is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18063) Failed to infer constraints over multiple aliases

2016-10-22 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-18063:


 Summary: Failed to infer constraints over multiple aliases
 Key: SPARK-18063
 URL: https://issues.apache.org/jira/browse/SPARK-18063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Jiang Xingbo
Priority: Minor


The `UnaryNode.getAliasedConstraints` function fails to replace all expressions 
by their alias where constraints contains more than one expression to be 
replaced. For example:
{code}
val tr = LocalRelation('a.int, 'b.string, 'c.int)
val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y))
multiAlias.analyze.constraints
{code}
currently outputs:
{code}
ExpressionSet(Seq(
IsNotNull(resolveColumn(multiAlias.analyze, "x")),
IsNotNull(resolveColumn(multiAlias.analyze, "y"))
)
{code}
The constraint {code}resolveColumn(multiAlias.analyze, "x") === 
resolveColumn(multiAlias.analyze, "y") + 10){code} is missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

43 matches

Mail list logo