[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-06 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469609#comment-15469609
 ] 

yuhao yang commented on SPARK-17094:


Something like Stanford CoreNLP pipeline: 

props.setProperty("annotators", 
"tokenize,ssplit,pos,lemma,ner,regexner,parse,mention,coref");

> provide simplified API for ML pipeline
> --
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17427) function SIZE should return -1 when parameter is null

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17427:


Assignee: (was: Apache Spark)

> function SIZE should return -1 when parameter is null
> -
>
> Key: SPARK-17427
> URL: https://issues.apache.org/jira/browse/SPARK-17427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>
> `select size(null)` returns -1 in Hive. In order to be compatible, we need to 
> return -1 also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17427) function SIZE should return -1 when parameter is null

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469569#comment-15469569
 ] 

Apache Spark commented on SPARK-17427:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14991

> function SIZE should return -1 when parameter is null
> -
>
> Key: SPARK-17427
> URL: https://issues.apache.org/jira/browse/SPARK-17427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>
> `select size(null)` returns -1 in Hive. In order to be compatible, we need to 
> return -1 also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17427) function SIZE should return -1 when parameter is null

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17427:


Assignee: Apache Spark

> function SIZE should return -1 when parameter is null
> -
>
> Key: SPARK-17427
> URL: https://issues.apache.org/jira/browse/SPARK-17427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Apache Spark
>Priority: Minor
>
> `select size(null)` returns -1 in Hive. In order to be compatible, we need to 
> return -1 also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17427) function SIZE should return -1 when parameter is null

2016-09-06 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-17427:
---

 Summary: function SIZE should return -1 when parameter is null
 Key: SPARK-17427
 URL: https://issues.apache.org/jira/browse/SPARK-17427
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Priority: Minor


`select size(null)` returns -1 in Hive. In order to be compatible, we need to 
return -1 also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-09-06 Thread Aris Vlasakakis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469564#comment-15469564
 ] 

Aris Vlasakakis commented on SPARK-17368:
-

It goes from inconvenient to actually prohibitive in a practical sense. I have 
a Dataset[Something], and inside case class Something I have various other case 
classes, and somewhere inside there there is a particular value class. It is so 
crazy to do manual unwrapping and rewrapping that at this point I just decided 
to eat the performance cost and use a regular class, not value class (I removed 
the 'extends AnyVal').

More generally, specially accommodating for value classes is *really hard* in a 
practical setting because if I have a whole bunch of ADTs and other case 
classes I'm working with, how do I know if anywhere in my domain I used a 
*value class* and I suddenly have to jump through a bunch of hoops just so 
Spark doesn't blow up? If I just had a Dataset[ThisIsAValueClass] with the 
top-level class being a value class, what you're saying is easy, but in 
practice the value class is one of many things somewhere deeper.

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17426:


Assignee: (was: Apache Spark)

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17426:


Assignee: Apache Spark

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469557#comment-15469557
 ] 

Apache Spark commented on SPARK-17426:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14990

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17426:
---
Target Version/s: 2.1.0
 Description: 
In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
other cases that may also trigger OOM. Current implementation of 
TreeNode.toJSON will recursively search and print all fields of current 
TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}

  was:
In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is 
super big. There are other cases that may also trigger OOM. Current 
implementation of TreeNode.toJSON will recursively search and print all fields 
of current TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}


> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The input can be 
> of arbitrary type, and may also be self-referencing. Trying to print user 
> space to JSON input is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17426:
---
Description: 
In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
other cases that may also trigger OOM. Current implementation of 
TreeNode.toJSON will recursively search and print all fields of current 
TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The user space 
input can be of arbitrary type, and may also be self-referencing. Trying to 
print user space input to JSON is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}

  was:
In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
other cases that may also trigger OOM. Current implementation of 
TreeNode.toJSON will recursively search and print all fields of current 
TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}


> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when Metadata is super big. There are 
> other cases that may also trigger OOM. Current implementation of 
> TreeNode.toJSON will recursively search and print all fields of current 
> TreeNode, even if the field's type is of type Seq or type Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The user space 
> input can be of arbitrary type, and may also be self-referencing. Trying to 
> print user space input to JSON is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17426:
---
Component/s: SQL

> Current TreeNode.toJSON may trigger OOM under some corner cases
> ---
>
> Key: SPARK-17426
> URL: https://issues.apache.org/jira/browse/SPARK-17426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>
> In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is 
> super big. There are other cases that may also trigger OOM. Current 
> implementation of TreeNode.toJSON will recursively search and print all 
> fields of current TreeNode, even if the field's type is of type Seq or type 
> Map. 
> This is not safe because:
> 1. the Seq or Map can be very big. Converting them to JSON make take huge 
> memory, which may trigger out of memory error.
> 2. Some user space input may also be propagated to the Plan. The input can be 
> of arbitrary type, and may also be self-referencing. Trying to print user 
> space to JSON input is very risky.
> The following example triggers a StackOverflowError when calling toJSON on a 
> plan with user defined UDF.
> {code}
> case class SelfReferenceUDF(
> var config: Map[String, Any] = Map.empty[String, Any]) extends 
> Function1[String, Boolean] {
>   config += "self" -> this
>   def apply(key: String): Boolean = config.contains(key)
> }
> test("toJSON should not throws java.lang.StackOverflowError") {
>   val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
>   // triggers java.lang.StackOverflowError
>   udf.toJSON
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17426) Current TreeNode.toJSON may trigger OOM under some corner cases

2016-09-06 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17426:
--

 Summary: Current TreeNode.toJSON may trigger OOM under some corner 
cases
 Key: SPARK-17426
 URL: https://issues.apache.org/jira/browse/SPARK-17426
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong


In SPARK-17356, we fix the OOM issue when {monospace}Metadata{monospace} is 
super big. There are other cases that may also trigger OOM. Current 
implementation of TreeNode.toJSON will recursively search and print all fields 
of current TreeNode, even if the field's type is of type Seq or type Map. 

This is not safe because:
1. the Seq or Map can be very big. Converting them to JSON make take huge 
memory, which may trigger out of memory error.
2. Some user space input may also be propagated to the Plan. The input can be 
of arbitrary type, and may also be self-referencing. Trying to print user space 
to JSON input is very risky.

The following example triggers a StackOverflowError when calling toJSON on a 
plan with user defined UDF.
{code}

case class SelfReferenceUDF(
var config: Map[String, Any] = Map.empty[String, Any]) extends 
Function1[String, Boolean] {
  config += "self" -> this
  def apply(key: String): Boolean = config.contains(key)
}

test("toJSON should not throws java.lang.StackOverflowError") {
  val udf = ScalaUDF(SelfReferenceUDF(), BooleanType, Seq("col1".attr))
  // triggers java.lang.StackOverflowError
  udf.toJSON
}

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-06 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469487#comment-15469487
 ] 

Jacek Laskowski commented on SPARK-17405:
-

It definitely got better with the build today Sept, 7th. Yesterday, even such a 
simple query died {{Seq(1).toDF.groupBy('value).count.show}}.

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, TIMESTAMP('2005-11-29 00:00:00.0'), 224.89000BD)
>   ) as t);
> SELECT
> CAST(MIN(t2.smallint_col_4) AS STRING) AS char_col,
> LEAD(MAX((-387) + (727.64)), 90) OVER (PARTITION BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) ORDER BY COALESCE(t2.int_col_9, 
> t2.smallint_col_4, t2.int_col_9) DESC, CAST(MIN(t2.smallint_col_4) AS 
> STRING)) AS decimal_col,
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9) AS int_col
> FROM table_3 t1
> INNER JOIN table_1 t2 ON (((t2.timestamp_col_3) = (t1.timestamp_col_5)) AND 
> ((t2.string_col_10) = (t1.string_col_1))) AND ((t2.string_col_10) = 
> (t1.string_col_1))
> WHERE
> (t2.smallint_col_4) IN (t2.int_col_9, t2.int_col_9)
> GROUP BY
> COALESCE(t2.int_col_9, t2.smallint_col_4, t2.int_col_9);
> {code}
> Here's the OOM:
> {code}
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 
> (TID 9, localhost): java.lang.OutOfMemoryError: Unable to acquire 262144 
> bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:100)
> at 
> 

[jira] [Updated] (SPARK-6235) Address various 2G limits

2016-09-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-6235:
---
Attachment: (was: SPARK-6235_Design_V0.01.pdf)

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6235) Address various 2G limits

2016-09-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-6235:
---
Attachment: SPARK-6235_Design_V0.02.pdf

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
> Attachments: SPARK-6235_Design_V0.01.pdf, SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException

2016-09-06 Thread Tomer Kaftan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469295#comment-15469295
 ] 

Tomer Kaftan commented on SPARK-17110:
--

Thanks all who helped out with this!

> Pyspark with locality ANY throw java.io.StreamCorruptedException
> 
>
> Key: SPARK-17110
> URL: https://issues.apache.org/jira/browse/SPARK-17110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Cluster of 2 AWS r3.xlarge slaves launched via ec2 
> scripts, Spark 2.0.0, hadoop: yarn, pyspark shell
>Reporter: Tomer Kaftan
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In Pyspark 2.0.0, any task that accesses cached data non-locally throws a 
> StreamCorruptedException like the stacktrace below:
> {noformat}
> WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): 
> java.io.StreamCorruptedException: invalid stream header: 12010A80
> at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807)
> at java.io.ObjectInputStream.(ObjectInputStream.java:302)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
> at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The simplest way I have found to reproduce this is by running the following 
> code in the pyspark shell, on a cluster of 2 slaves set to use only one 
> worker core each:
> {code}
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}
> Or by running the following via spark-submit:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17372) Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

2016-09-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-17372.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14987
[https://github.com/apache/spark/pull/14987]

> Running a file stream on a directory with partitioned subdirs throw 
> NotSerializableException/StackOverflowError
> ---
>
> Key: SPARK-17372
> URL: https://issues.apache.org/jira/browse/SPARK-17372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.1, 2.1.0
>
>
> When we create a filestream on a directory that has partitioned subdirs (i.e. 
> dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as 
> Seq[String] which internally is a Stream[String]. This is because of this 
> [line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
>  where a LinkedHashSet.values.toSeq returns Stream. Then when the 
> [FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
>  filters this Stream[String] to remove the seen files, it creates a new 
> Stream[String], which has a filter function that has a $outer reference to 
> the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
> causes NotSerializableException. This will happened even if there is just one 
> file in the dir.
> Its important to note that this behavior is different in Scala 2.11. There is 
> no $outer reference to FileStreamSource, so it does not throw 
> NotSerializableException. However, with a large sequence of files (tested 
> with 1 files), it throws StackOverflowError. This is because how Stream 
> class is implemented. Its basically like a linked list, and attempting to 
> serialize a long Stream requires *recursively* going through linked list, 
> thus resulting in StackOverflowError.
> In short, across both Scala 2.10 and 2.11, serialization fails when both the 
> following conditions are true. 
> - file stream defined on a partitioned directory  
> - directory has 10k+ files
> The right solution is to convert the seq to an array before writing to the 
> log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17279) better error message for exceptions during ScalaUDF execution

2016-09-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17279:

Fix Version/s: 2.0.1

> better error message for exceptions during ScalaUDF execution
> -
>
> Key: SPARK-17279
> URL: https://issues.apache.org/jira/browse/SPARK-17279
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReuseExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Summary: Override sameResult in HiveTableScanExec to make ReuseExchange 
work in text format table  (was: Override sameResult in HiveTableScanExec to 
make ReusedExchange work in text format table)

> Override sameResult in HiveTableScanExec to make ReuseExchange work in text 
> format table
> 
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17425:


Assignee: (was: Apache Spark)

> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469249#comment-15469249
 ] 

Apache Spark commented on SPARK-17425:
--

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14988

> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17425:


Assignee: Apache Spark

> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>Assignee: Apache Spark
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Description: 
When I run the below SQL(table src is text format):
{code:sql}
SELECT * FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't.

  was:
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't.


> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT * FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Description: 
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,

  was:
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src, PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,


> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT 1 FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Description: 
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src, PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,

  was:
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,


> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT 1 FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src, PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Description: 
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't.

  was:
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,


> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT 1 FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format) instead of src(text format), PhysicalPlan contain *ReuseExchange* in 
> PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Description: 
When I run the below SQL(table src is text format):
{code:sql}
SELECT 1 FROM src t1
JOIN src t2 ON t1.key = t2.key
JOIN src t3 ON t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,

  was:
When I run the below SQL(table src is text format):
{code:sql}
select 1 from src t1
join src t2 on t1.key = t2.key
join src t3 on t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,


> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> SELECT 1 FROM src t1
> JOIN src t2 ON t1.key = t2.key
> JOIN src t3 ON t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Affects Version/s: 2.0.0
  Component/s: SQL

> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> select 1 from src t1
> join src t2 on t1.key = t2.key
> join src t3 on t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-17425:
-

 Summary: Override sameResult in HiveTableScanExec to make 
ReusedExchange work in text format table
 Key: SPARK-17425
 URL: https://issues.apache.org/jira/browse/SPARK-17425
 Project: Spark
  Issue Type: Bug
Reporter: Yadong Qi


When I run the below SQL(table src is text format):
{code:sql}
select 1 from src t1
join src t2 on t1.key = t2.key
join src t3 on t1.key = t3.key;
{code}
the PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17425) Override sameResult in HiveTableScanExec to make ReusedExchange work in text format table

2016-09-06 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-17425:
--
Description: 
When I run the below SQL(table src is text format):
{code:sql}
select 1 from src t1
join src t2 on t1.key = t2.key
join src t3 on t1.key = t3.key;
{code}
The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,

  was:
When I run the below SQL(table src is text format):
{code:sql}
select 1 from src t1
join src t2 on t1.key = t2.key
join src t3 on t1.key = t3.key;
{code}
the PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
*HiveTableScanExec* didn't,


> Override sameResult in HiveTableScanExec to make ReusedExchange work in text 
> format table
> -
>
> Key: SPARK-17425
> URL: https://issues.apache.org/jira/browse/SPARK-17425
> Project: Spark
>  Issue Type: Bug
>Reporter: Yadong Qi
>
> When I run the below SQL(table src is text format):
> {code:sql}
> select 1 from src t1
> join src t2 on t1.key = t2.key
> join src t3 on t1.key = t3.key;
> {code}
> The PhysicalPlan doesn't contain *ReuseExchange*. And I use src_pqt(parquet 
> format), PhysicalPlan contain *ReuseExchange* in PhysicalPlan.
> I found the *sameResult* in *FileSourceScanExec* has already overrided, but 
> *HiveTableScanExec* didn't,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17238) simplify the logic for converting data source table into hive compatible format

2016-09-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17238.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14809
[https://github.com/apache/spark/pull/14809]

> simplify the logic for converting data source table into hive compatible 
> format
> ---
>
> Key: SPARK-17238
> URL: https://issues.apache.org/jira/browse/SPARK-17238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17372) Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

2016-09-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-17372:
--
Description: 
When we create a filestream on a directory that has partitioned subdirs (i.e. 
dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as 
Seq[String] which internally is a Stream[String]. This is because of this 
[line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
 where a LinkedHashSet.values.toSeq returns Stream. Then when the 
[FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
 filters this Stream[String] to remove the seen files, it creates a new 
Stream[String], which has a filter function that has a $outer reference to the 
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
causes NotSerializableException. This will happened even if there is just one 
file in the dir.

Its important to note that this behavior is different in Scala 2.11. There is 
no $outer reference to FileStreamSource, so it does not throw 
NotSerializableException. However, with a large sequence of files (tested with 
1 files), it throws StackOverflowError. This is because how Stream class is 
implemented. Its basically like a linked list, and attempting to serialize a 
long Stream requires *recursively* going through linked list, thus resulting in 
StackOverflowError.

In short, across both Scala 2.10 and 2.11, serialization fails when both the 
following conditions are true. 
- file stream defined on a partitioned directory  
- directory has 10k+ files

The right solution is to convert the seq to an array before writing to the log.



  was:
When we create a filestream on a directory that has partitioned subdirs (i.e. 
dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as 
Seq[String] which internally is a Stream[String]. This is because of this 
[line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
 where a LinkedHashSet.values.toSeq returns Stream. Then when the 
[FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
 filters this Stream[String] to remove the seen files, it creates a new 
Stream[String], which has a filter function that has a $outer reference to the 
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
causes NotSerializableException. This will happened even if there is just one 
file in the dir.

Its important to note that this behavior is different in Scala 2.11. There is 
no $outer reference to FileStreamSource, so it does not throw 
NotSerializableException. However, with a large sequence of files (tested with 
5000 files), it throws StackOverflowError. This is because how Stream class is 
implemented. Its basically like a linked list, and attempting to serialize a 
long Stream requires *recursively* going through linked list, thus resulting in 
StackOverflowError.

The right solution is to convert the seq to an array before writing to the log.




> Running a file stream on a directory with partitioned subdirs throw 
> NotSerializableException/StackOverflowError
> ---
>
> Key: SPARK-17372
> URL: https://issues.apache.org/jira/browse/SPARK-17372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When we create a filestream on a directory that has partitioned subdirs (i.e. 
> dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as 
> Seq[String] which internally is a Stream[String]. This is because of this 
> [line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
>  where a LinkedHashSet.values.toSeq returns Stream. Then when the 
> [FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
>  filters this Stream[String] to remove the seen files, it creates a new 
> Stream[String], which has a filter function that has a $outer reference to 
> the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
> causes NotSerializableException. This will happened even if there is just one 
> file in the dir.
> Its important to note that this behavior is different in Scala 2.11. There is 

[jira] [Updated] (SPARK-17372) Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

2016-09-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-17372:
--
Description: 
When we create a filestream on a directory that has partitioned subdirs (i.e. 
dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as 
Seq[String] which internally is a Stream[String]. This is because of this 
[line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
 where a LinkedHashSet.values.toSeq returns Stream. Then when the 
[FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
 filters this Stream[String] to remove the seen files, it creates a new 
Stream[String], which has a filter function that has a $outer reference to the 
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
causes NotSerializableException. This will happened even if there is just one 
file in the dir.

Its important to note that this behavior is different in Scala 2.11. There is 
no $outer reference to FileStreamSource, so it does not throw 
NotSerializableException. However, with a large sequence of files (tested with 
5000 files), it throws StackOverflowError. This is because how Stream class is 
implemented. Its basically like a linked list, and attempting to serialize a 
long Stream requires *recursively* going through linked list, thus resulting in 
StackOverflowError.

The right solution is to convert the seq to an array before writing to the log.



  was:
Here is the result of my investigation. When we create a filestream on a 
directory that has partitioned subdirs (i.e. dir/x=y/), then 
ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which 
internally is a Stream[String]. This is because of this 
[line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
 where a LinkedHashSet.values.toSeq returns Stream. Then when the 
[FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
 filters this Stream[String] to remove the seen files, it creates a new 
Stream[String], which has a filter function that has a $outer reference to the 
FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
causes NotSerializableException. This will happened even if there is just one 
file in the dir.

Its important to note that this behavior is different in Scala 2.11. There is 
no $outer reference to FileStreamSource, so it does not throw 
NotSerializableException. However, with a large sequence of files (tested with 
5000 files), it throws StackOverflowError. This is because how Stream class is 
implemented. Its basically like a linked list, and attempting to serialize a 
long Stream requires *recursively* going through linked list, thus resulting in 
StackOverflowError.

The right solution is to convert the seq to an array before writing to the log.




> Running a file stream on a directory with partitioned subdirs throw 
> NotSerializableException/StackOverflowError
> ---
>
> Key: SPARK-17372
> URL: https://issues.apache.org/jira/browse/SPARK-17372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When we create a filestream on a directory that has partitioned subdirs (i.e. 
> dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as 
> Seq[String] which internally is a Stream[String]. This is because of this 
> [line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
>  where a LinkedHashSet.values.toSeq returns Stream. Then when the 
> [FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
>  filters this Stream[String] to remove the seen files, it creates a new 
> Stream[String], which has a filter function that has a $outer reference to 
> the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
> causes NotSerializableException. This will happened even if there is just one 
> file in the dir.
> Its important to note that this behavior is different in Scala 2.11. There is 
> no $outer reference to FileStreamSource, so it does not throw 
> NotSerializableException. However, with a large sequence of files (tested 
> 

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-09-06 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469092#comment-15469092
 ] 

Sital Kedia commented on SPARK-16922:
-

There is no noticable performance gain I observed comparing to BytesToBytesMap. 
Part of the reason might be due to the fact that BroadcastHashJoin was 
consuming very less percentage of the total job time. 

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17372) Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15469072#comment-15469072
 ] 

Apache Spark commented on SPARK-17372:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/14987

> Running a file stream on a directory with partitioned subdirs throw 
> NotSerializableException/StackOverflowError
> ---
>
> Key: SPARK-17372
> URL: https://issues.apache.org/jira/browse/SPARK-17372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Here is the result of my investigation. When we create a filestream on a 
> directory that has partitioned subdirs (i.e. dir/x=y/), then 
> ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which 
> internally is a Stream[String]. This is because of this 
> [line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
>  where a LinkedHashSet.values.toSeq returns Stream. Then when the 
> [FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
>  filters this Stream[String] to remove the seen files, it creates a new 
> Stream[String], which has a filter function that has a $outer reference to 
> the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
> causes NotSerializableException. This will happened even if there is just one 
> file in the dir.
> Its important to note that this behavior is different in Scala 2.11. There is 
> no $outer reference to FileStreamSource, so it does not throw 
> NotSerializableException. However, with a large sequence of files (tested 
> with 5000 files), it throws StackOverflowError. This is because how Stream 
> class is implemented. Its basically like a linked list, and attempting to 
> serialize a long Stream requires *recursively* going through linked list, 
> thus resulting in StackOverflowError.
> The right solution is to convert the seq to an array before writing to the 
> log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17372) Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17372:


Assignee: Tathagata Das  (was: Apache Spark)

> Running a file stream on a directory with partitioned subdirs throw 
> NotSerializableException/StackOverflowError
> ---
>
> Key: SPARK-17372
> URL: https://issues.apache.org/jira/browse/SPARK-17372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Here is the result of my investigation. When we create a filestream on a 
> directory that has partitioned subdirs (i.e. dir/x=y/), then 
> ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which 
> internally is a Stream[String]. This is because of this 
> [line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
>  where a LinkedHashSet.values.toSeq returns Stream. Then when the 
> [FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
>  filters this Stream[String] to remove the seen files, it creates a new 
> Stream[String], which has a filter function that has a $outer reference to 
> the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
> causes NotSerializableException. This will happened even if there is just one 
> file in the dir.
> Its important to note that this behavior is different in Scala 2.11. There is 
> no $outer reference to FileStreamSource, so it does not throw 
> NotSerializableException. However, with a large sequence of files (tested 
> with 5000 files), it throws StackOverflowError. This is because how Stream 
> class is implemented. Its basically like a linked list, and attempting to 
> serialize a long Stream requires *recursively* going through linked list, 
> thus resulting in StackOverflowError.
> The right solution is to convert the seq to an array before writing to the 
> log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17372) Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17372:


Assignee: Apache Spark  (was: Tathagata Das)

> Running a file stream on a directory with partitioned subdirs throw 
> NotSerializableException/StackOverflowError
> ---
>
> Key: SPARK-17372
> URL: https://issues.apache.org/jira/browse/SPARK-17372
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Here is the result of my investigation. When we create a filestream on a 
> directory that has partitioned subdirs (i.e. dir/x=y/), then 
> ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which 
> internally is a Stream[String]. This is because of this 
> [line|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L93],
>  where a LinkedHashSet.values.toSeq returns Stream. Then when the 
> [FileStreamSource|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L79]
>  filters this Stream[String] to remove the seen files, it creates a new 
> Stream[String], which has a filter function that has a $outer reference to 
> the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] 
> causes NotSerializableException. This will happened even if there is just one 
> file in the dir.
> Its important to note that this behavior is different in Scala 2.11. There is 
> no $outer reference to FileStreamSource, so it does not throw 
> NotSerializableException. However, with a large sequence of files (tested 
> with 5000 files), it throws StackOverflowError. This is because how Stream 
> class is implemented. Its basically like a linked list, and attempting to 
> serialize a long Stream requires *recursively* going through linked list, 
> thus resulting in StackOverflowError.
> The right solution is to convert the seq to an array before writing to the 
> log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17408) Flaky test: org.apache.spark.sql.hive.StatisticsSuite

2016-09-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17408:

Assignee: Xiao Li

> Flaky test: org.apache.spark.sql.hive.StatisticsSuite
> -
>
> Key: SPARK-17408
> URL: https://issues.apache.org/jira/browse/SPARK-17408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64956/testReport/junit/org.apache.spark.sql.hive/StatisticsSuite/test_statistics_of_LogicalRelation_converted_from_MetastoreRelation/
> {code}
> org.apache.spark.sql.hive.StatisticsSuite.test statistics of LogicalRelation 
> converted from MetastoreRelation
> Failing for the past 1 build (Since Failed#64956 )
> Took 1.4 sec.
> Error Message
> org.scalatest.exceptions.TestFailedException: 6871 did not equal 4236
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 6871 
> did not equal 4236
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$14.applyOrElse(StatisticsSuite.scala:247)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$14.applyOrElse(StatisticsSuite.scala:241)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$collect$1.apply(TreeNode.scala:158)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$collect$1.apply(TreeNode.scala:158)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.collect(TreeNode.scala:158)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite.org$apache$spark$sql$hive$StatisticsSuite$$checkLogicalRelationStats(StatisticsSuite.scala:241)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6$$anonfun$apply$mcV$sp$3$$anonfun$apply$mcV$sp$4.apply$mcV$sp(StatisticsSuite.scala:271)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:99)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite.withSQLConf(StatisticsSuite.scala:35)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6$$anonfun$apply$mcV$sp$3.apply$mcV$sp(StatisticsSuite.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:168)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite.withTable(StatisticsSuite.scala:35)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6.apply$mcV$sp(StatisticsSuite.scala:260)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6.apply(StatisticsSuite.scala:257)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6.apply(StatisticsSuite.scala:257)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)

[jira] [Resolved] (SPARK-17408) Flaky test: org.apache.spark.sql.hive.StatisticsSuite

2016-09-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17408.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14978
[https://github.com/apache/spark/pull/14978]

> Flaky test: org.apache.spark.sql.hive.StatisticsSuite
> -
>
> Key: SPARK-17408
> URL: https://issues.apache.org/jira/browse/SPARK-17408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64956/testReport/junit/org.apache.spark.sql.hive/StatisticsSuite/test_statistics_of_LogicalRelation_converted_from_MetastoreRelation/
> {code}
> org.apache.spark.sql.hive.StatisticsSuite.test statistics of LogicalRelation 
> converted from MetastoreRelation
> Failing for the past 1 build (Since Failed#64956 )
> Took 1.4 sec.
> Error Message
> org.scalatest.exceptions.TestFailedException: 6871 did not equal 4236
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 6871 
> did not equal 4236
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$14.applyOrElse(StatisticsSuite.scala:247)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$14.applyOrElse(StatisticsSuite.scala:241)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223)
>   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$collect$1.apply(TreeNode.scala:158)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$collect$1.apply(TreeNode.scala:158)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreach$1.apply(TreeNode.scala:118)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:118)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.collect(TreeNode.scala:158)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite.org$apache$spark$sql$hive$StatisticsSuite$$checkLogicalRelationStats(StatisticsSuite.scala:241)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6$$anonfun$apply$mcV$sp$3$$anonfun$apply$mcV$sp$4.apply$mcV$sp(StatisticsSuite.scala:271)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:99)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite.withSQLConf(StatisticsSuite.scala:35)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6$$anonfun$apply$mcV$sp$3.apply$mcV$sp(StatisticsSuite.scala:268)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:168)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite.withTable(StatisticsSuite.scala:35)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6.apply$mcV$sp(StatisticsSuite.scala:260)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6.apply(StatisticsSuite.scala:257)
>   at 
> org.apache.spark.sql.hive.StatisticsSuite$$anonfun$6.apply(StatisticsSuite.scala:257)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 

[jira] [Updated] (SPARK-17371) Resubmitted stage outputs deleted by zombie map tasks on stop()

2016-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17371:
---
Assignee: Eric Liang

> Resubmitted stage outputs deleted by zombie map tasks on stop()
> ---
>
> Key: SPARK-17371
> URL: https://issues.apache.org/jira/browse/SPARK-17371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> It seems that old shuffle map tasks hanging around after a stage resubmit 
> will delete intended shuffle output files on stop(), causing downstream 
> stages to fail even after successful resubmit completion. This can happen 
> easily if the prior map task is waiting for a network timeout when its stage 
> is resubmitted.
> This can cause unnecessary stage resubmits, sometimes multiple times, and 
> very confusing FetchFailure messages that report shuffle index files missing 
> from the local disk.
> Given that IndexShuffleBlockResolver commits data atomically, it seems 
> unnecessary to ever delete committed task output: even in the rare case that 
> a task is failed after it finishes committing shuffle output, it should be 
> safe to retain that output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17371) Resubmitted stage outputs deleted by zombie map tasks on stop()

2016-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17371.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14932
[https://github.com/apache/spark/pull/14932]

> Resubmitted stage outputs deleted by zombie map tasks on stop()
> ---
>
> Key: SPARK-17371
> URL: https://issues.apache.org/jira/browse/SPARK-17371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eric Liang
> Fix For: 2.1.0
>
>
> It seems that old shuffle map tasks hanging around after a stage resubmit 
> will delete intended shuffle output files on stop(), causing downstream 
> stages to fail even after successful resubmit completion. This can happen 
> easily if the prior map task is waiting for a network timeout when its stage 
> is resubmitted.
> This can cause unnecessary stage resubmits, sometimes multiple times, and 
> very confusing FetchFailure messages that report shuffle index files missing 
> from the local disk.
> Given that IndexShuffleBlockResolver commits data atomically, it seems 
> unnecessary to ever delete committed task output: even in the rare case that 
> a task is failed after it finishes committing shuffle output, it should be 
> safe to retain that output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17423) Support IGNORE NULLS option in Window functions

2016-09-06 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468973#comment-15468973
 ] 

Herman van Hovell commented on SPARK-17423:
---

IMO LEAD and LAG with ignore nulls seem a bit dodgy to me.

We support `FIRST` and `LAST` with ignore nulls. You could use this in 
combination with a window frame to create the same functionality. For instance, 
the following query would find the first non null value within the leading 5 
rows:
{noformat}
SELECT *,
   FIRST(value, true) OVER (PARTITION BY grp ORDER BY date ROWS BETWEEN 5 
PRECEDING AND CURRENT ROW)
FROM   tbl
{noformat}

> Support IGNORE NULLS option in Window functions
> ---
>
> Key: SPARK-17423
> URL: https://issues.apache.org/jira/browse/SPARK-17423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tim Chan
>Priority: Minor
>
> http://stackoverflow.com/questions/24338119/is-it-possible-to-ignore-null-values-when-using-lag-and-lead-functions-in-sq



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17421) Warnings about "MaxPermSize" parameter when building with Maven and Java 8

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17421:


Assignee: Apache Spark

> Warnings about "MaxPermSize" parameter when building with Maven and Java 8
> --
>
> Key: SPARK-17421
> URL: https://issues.apache.org/jira/browse/SPARK-17421
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Assignee: Apache Spark
>Priority: Minor
>
> When building Spark with {{build/mvn}} or {{dev/run-tests}}, a Java warning 
> appears repeatedly on STDERR:
> {{OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support 
> was removed in 8.0}}
> This warning is due to {{build/mvn}} adding the {{-XX:MaxPermSize=512M}} 
> option to {{MAVEN_OPTS}}. When compiling with Java 7, this parameter is 
> essential. With Java 8, the parameter leads to the warning above.
> Because {{build/mvn}} adds {{MaxPermSize}} to {{MAVEN_OPTS}}, even if that 
> environment variable doesn't contain the option, setting {{MAVEN_OPTS}} to a 
> string that does not contain {{MaxPermSize}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17421) Warnings about "MaxPermSize" parameter when building with Maven and Java 8

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468943#comment-15468943
 ] 

Apache Spark commented on SPARK-17421:
--

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/14986

> Warnings about "MaxPermSize" parameter when building with Maven and Java 8
> --
>
> Key: SPARK-17421
> URL: https://issues.apache.org/jira/browse/SPARK-17421
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Priority: Minor
>
> When building Spark with {{build/mvn}} or {{dev/run-tests}}, a Java warning 
> appears repeatedly on STDERR:
> {{OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support 
> was removed in 8.0}}
> This warning is due to {{build/mvn}} adding the {{-XX:MaxPermSize=512M}} 
> option to {{MAVEN_OPTS}}. When compiling with Java 7, this parameter is 
> essential. With Java 8, the parameter leads to the warning above.
> Because {{build/mvn}} adds {{MaxPermSize}} to {{MAVEN_OPTS}}, even if that 
> environment variable doesn't contain the option, setting {{MAVEN_OPTS}} to a 
> string that does not contain {{MaxPermSize}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17421) Warnings about "MaxPermSize" parameter when building with Maven and Java 8

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17421:


Assignee: (was: Apache Spark)

> Warnings about "MaxPermSize" parameter when building with Maven and Java 8
> --
>
> Key: SPARK-17421
> URL: https://issues.apache.org/jira/browse/SPARK-17421
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Frederick Reiss
>Priority: Minor
>
> When building Spark with {{build/mvn}} or {{dev/run-tests}}, a Java warning 
> appears repeatedly on STDERR:
> {{OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support 
> was removed in 8.0}}
> This warning is due to {{build/mvn}} adding the {{-XX:MaxPermSize=512M}} 
> option to {{MAVEN_OPTS}}. When compiling with Java 7, this parameter is 
> essential. With Java 8, the parameter leads to the warning above.
> Because {{build/mvn}} adds {{MaxPermSize}} to {{MAVEN_OPTS}}, even if that 
> environment variable doesn't contain the option, setting {{MAVEN_OPTS}} to a 
> string that does not contain {{MaxPermSize}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table

2016-09-06 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468932#comment-15468932
 ] 

Ryan Blue commented on SPARK-17396:
---

I opened a PR with a fix. It still uses a ForkJoinPool because the thread pool 
task support is marked as deprecated in Scala. So I guess we should use 
fork/join.

> Threads number keep increasing when query on external CSV partitioned table
> ---
>
> Key: SPARK-17396
> URL: https://issues.apache.org/jira/browse/SPARK-17396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> 1. Create a external partitioned table row format CSV
> 2. Add 16 partitions to the table
> 3. Run SQL "select count(*) from test_csv"
> 4. ForkJoinThread number keep increasing 
> This happend when table partitions number greater than 10.
> 5. Test Code
> {code:lang=java}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object Bugs {
>   def main(args: Array[String]): Unit = {
> val location = "file:///g:/home/test/csv"
> val create = s"""CREATE   EXTERNAL  TABLE  test_csv
>  (ID string,  SEQ string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>   ALTER TABLE test_csv ADD 
>   PARTITION (index=1)LOCATION '${location}/index=1'
>   PARTITION (index=2)LOCATION '${location}/index=2'
>   PARTITION (index=3)LOCATION '${location}/index=3'
>   PARTITION (index=4)LOCATION '${location}/index=4'
>   PARTITION (index=5)LOCATION '${location}/index=5'
>   PARTITION (index=6)LOCATION '${location}/index=6'
>   PARTITION (index=7)LOCATION '${location}/index=7'
>   PARTITION (index=8)LOCATION '${location}/index=8'
>   PARTITION (index=9)LOCATION '${location}/index=9'
>   PARTITION (index=10)LOCATION '${location}/index=10'
>   PARTITION (index=11)LOCATION '${location}/index=11'
>   PARTITION (index=12)LOCATION '${location}/index=12'
>   PARTITION (index=13)LOCATION '${location}/index=13'
>   PARTITION (index=14)LOCATION '${location}/index=14'
>   PARTITION (index=15)LOCATION '${location}/index=15'
>   PARTITION (index=16)LOCATION '${location}/index=16'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> hctx.sql(create)
> hctx.sql(add_part)
>  for (i <- 1 to 6) {
>   new Query(hctx).start()
> }
>   }
>   class Query(htcx: HiveContext) extends Thread {
> setName("Query-Thread")
> override def run = {
>   while (true) {
> htcx.sql("select count(*) from test_csv").show()
> Thread.sleep(100)
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468929#comment-15468929
 ] 

Apache Spark commented on SPARK-17396:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/14985

> Threads number keep increasing when query on external CSV partitioned table
> ---
>
> Key: SPARK-17396
> URL: https://issues.apache.org/jira/browse/SPARK-17396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> 1. Create a external partitioned table row format CSV
> 2. Add 16 partitions to the table
> 3. Run SQL "select count(*) from test_csv"
> 4. ForkJoinThread number keep increasing 
> This happend when table partitions number greater than 10.
> 5. Test Code
> {code:lang=java}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object Bugs {
>   def main(args: Array[String]): Unit = {
> val location = "file:///g:/home/test/csv"
> val create = s"""CREATE   EXTERNAL  TABLE  test_csv
>  (ID string,  SEQ string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>   ALTER TABLE test_csv ADD 
>   PARTITION (index=1)LOCATION '${location}/index=1'
>   PARTITION (index=2)LOCATION '${location}/index=2'
>   PARTITION (index=3)LOCATION '${location}/index=3'
>   PARTITION (index=4)LOCATION '${location}/index=4'
>   PARTITION (index=5)LOCATION '${location}/index=5'
>   PARTITION (index=6)LOCATION '${location}/index=6'
>   PARTITION (index=7)LOCATION '${location}/index=7'
>   PARTITION (index=8)LOCATION '${location}/index=8'
>   PARTITION (index=9)LOCATION '${location}/index=9'
>   PARTITION (index=10)LOCATION '${location}/index=10'
>   PARTITION (index=11)LOCATION '${location}/index=11'
>   PARTITION (index=12)LOCATION '${location}/index=12'
>   PARTITION (index=13)LOCATION '${location}/index=13'
>   PARTITION (index=14)LOCATION '${location}/index=14'
>   PARTITION (index=15)LOCATION '${location}/index=15'
>   PARTITION (index=16)LOCATION '${location}/index=16'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> hctx.sql(create)
> hctx.sql(add_part)
>  for (i <- 1 to 6) {
>   new Query(hctx).start()
> }
>   }
>   class Query(htcx: HiveContext) extends Thread {
> setName("Query-Thread")
> override def run = {
>   while (true) {
> htcx.sql("select count(*) from test_csv").show()
> Thread.sleep(100)
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468928#comment-15468928
 ] 

Apache Spark commented on SPARK-17296:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/14984

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17396:


Assignee: Apache Spark

> Threads number keep increasing when query on external CSV partitioned table
> ---
>
> Key: SPARK-17396
> URL: https://issues.apache.org/jira/browse/SPARK-17396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Assignee: Apache Spark
>
> 1. Create a external partitioned table row format CSV
> 2. Add 16 partitions to the table
> 3. Run SQL "select count(*) from test_csv"
> 4. ForkJoinThread number keep increasing 
> This happend when table partitions number greater than 10.
> 5. Test Code
> {code:lang=java}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object Bugs {
>   def main(args: Array[String]): Unit = {
> val location = "file:///g:/home/test/csv"
> val create = s"""CREATE   EXTERNAL  TABLE  test_csv
>  (ID string,  SEQ string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>   ALTER TABLE test_csv ADD 
>   PARTITION (index=1)LOCATION '${location}/index=1'
>   PARTITION (index=2)LOCATION '${location}/index=2'
>   PARTITION (index=3)LOCATION '${location}/index=3'
>   PARTITION (index=4)LOCATION '${location}/index=4'
>   PARTITION (index=5)LOCATION '${location}/index=5'
>   PARTITION (index=6)LOCATION '${location}/index=6'
>   PARTITION (index=7)LOCATION '${location}/index=7'
>   PARTITION (index=8)LOCATION '${location}/index=8'
>   PARTITION (index=9)LOCATION '${location}/index=9'
>   PARTITION (index=10)LOCATION '${location}/index=10'
>   PARTITION (index=11)LOCATION '${location}/index=11'
>   PARTITION (index=12)LOCATION '${location}/index=12'
>   PARTITION (index=13)LOCATION '${location}/index=13'
>   PARTITION (index=14)LOCATION '${location}/index=14'
>   PARTITION (index=15)LOCATION '${location}/index=15'
>   PARTITION (index=16)LOCATION '${location}/index=16'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> hctx.sql(create)
> hctx.sql(add_part)
>  for (i <- 1 to 6) {
>   new Query(hctx).start()
> }
>   }
>   class Query(htcx: HiveContext) extends Thread {
> setName("Query-Thread")
> override def run = {
>   while (true) {
> htcx.sql("select count(*) from test_csv").show()
> Thread.sleep(100)
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17396) Threads number keep increasing when query on external CSV partitioned table

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17396:


Assignee: (was: Apache Spark)

> Threads number keep increasing when query on external CSV partitioned table
> ---
>
> Key: SPARK-17396
> URL: https://issues.apache.org/jira/browse/SPARK-17396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> 1. Create a external partitioned table row format CSV
> 2. Add 16 partitions to the table
> 3. Run SQL "select count(*) from test_csv"
> 4. ForkJoinThread number keep increasing 
> This happend when table partitions number greater than 10.
> 5. Test Code
> {code:lang=java}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.hive.HiveContext
> object Bugs {
>   def main(args: Array[String]): Unit = {
> val location = "file:///g:/home/test/csv"
> val create = s"""CREATE   EXTERNAL  TABLE  test_csv
>  (ID string,  SEQ string )
>   PARTITIONED BY(index int)
>   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>   LOCATION "${location}" 
>   """
> val add_part = s"""
>   ALTER TABLE test_csv ADD 
>   PARTITION (index=1)LOCATION '${location}/index=1'
>   PARTITION (index=2)LOCATION '${location}/index=2'
>   PARTITION (index=3)LOCATION '${location}/index=3'
>   PARTITION (index=4)LOCATION '${location}/index=4'
>   PARTITION (index=5)LOCATION '${location}/index=5'
>   PARTITION (index=6)LOCATION '${location}/index=6'
>   PARTITION (index=7)LOCATION '${location}/index=7'
>   PARTITION (index=8)LOCATION '${location}/index=8'
>   PARTITION (index=9)LOCATION '${location}/index=9'
>   PARTITION (index=10)LOCATION '${location}/index=10'
>   PARTITION (index=11)LOCATION '${location}/index=11'
>   PARTITION (index=12)LOCATION '${location}/index=12'
>   PARTITION (index=13)LOCATION '${location}/index=13'
>   PARTITION (index=14)LOCATION '${location}/index=14'
>   PARTITION (index=15)LOCATION '${location}/index=15'
>   PARTITION (index=16)LOCATION '${location}/index=16'
> """
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> conf.set("spark.sql.warehouse.dir", "file:///g:/home/warehouse")
> val ctx = new SparkContext(conf)
> val hctx = new HiveContext(ctx)
> hctx.sql(create)
> hctx.sql(add_part)
>  for (i <- 1 to 6) {
>   new Query(hctx).start()
> }
>   }
>   class Query(htcx: HiveContext) extends Thread {
> setName("Query-Thread")
> override def run = {
>   while (true) {
> htcx.sql("select count(*) from test_csv").show()
> Thread.sleep(100)
>   }
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect

2016-09-06 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-17424:
--
Description: 
I have a job that uses datasets in 1.6.1 and is failing with this error:

{code}
16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
java.lang.AssertionError: assertion failed: Unsound substitution from List(type 
T, type U) to List()
java.lang.AssertionError: assertion failed: Unsound substitution from List(type 
T, type U) to List()
at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
at scala.reflect.internal.Types$Type.subst(Types.scala:796)
at scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
at scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:51)
at 

[jira] [Created] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect

2016-09-06 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-17424:
-

 Summary: Dataset job fails from unsound substitution in 
ScalaReflect
 Key: SPARK-17424
 URL: https://issues.apache.org/jira/browse/SPARK-17424
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.1
Reporter: Ryan Blue


I have a job that uses datasets in 1.6.1 and is failing with this error:

{code}
16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
java.lang.AssertionError: assertion failed: Unsound substitution from List(type 
T, type U) to List()
java.lang.AssertionError: assertion failed: Unsound substitution from List(type 
T, type U) to List()
at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
at scala.reflect.internal.Types$Type.subst(Types.scala:796)
at scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
at scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)

[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-09-06 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468891#comment-15468891
 ] 

Srinath commented on SPARK-16026:
-

I have a couple of comments/questions on the proposal.
Regarding the join reordering algorithm: 
One of the big wins we should be able to get is avoiding shuffles/broadcasts. 
If the costing and dynamic programming algo doesn't take into account change 
costs and output partitioning we may produce some bad plans.
Here's an example: Suppose we start with completely unpartitioned tables A(a), 
B(b1, b2), C(c) and D(d), in increasing order of size and let's assume none of 
them are small enough to broadcast. Suppose we want to optimize the following 
join 
(A join B on A.a = B.b1) join C on (B.b2 = C.c) join D on (B.b1 = D.d).
Since A, B C and D are in increasing order of size and we try to minimize 
intermediate result size, we end up with the following “cheapest” plan (join 
order A-B-C-D):
{noformat}
Plan I
Join(B.b1 = D.d)
|-Exchange(b1)
|   Join(B.b2 = c)
|   |-Exchange(b2)
|   |   Join(A.a = B.b1)
|   |   |-Exchange(a)
|   |   |   A
|   |   | Exchange(b1)
|   |       B
|   | Exchange(c)
|       C
|-Exchange(d)
    D
{noformat}
Ignoring leaf node sizes, the cost according to the proposed model, i.e. the 
intermediate data size is Size(A join B) + size(ABC). This is also the size of 
intermediate data exchanged.
But a better plan may be to join to D before C (i.e. join order A-B-D-C) 
because that would avoid a re-shuffle 
{noformat}
Plan II
Join(B.b2 = C.c)
|-Exchange(B.b2)
|   Join (B.b1 = d)
|   |-Join(A.a = B.b1)
|   | |-Exchange(a)
|   | |   A
|   | | Exchange(b1)
|   |     B  
|   |-Exchange(d)
|       D
|-Exchange(c)
    C
{noformat}
The cost of this plan, i.e. the intermediate data size, is size(AB) + 
size(ABD), which is higher than Plan I. But the size of intermediate data 
exchanged is  size(ABD) which may be lower than size(AB) + size(ABC) of Plan I. 
This plan could be significantly faster as a result.

It should be relatively painless to incorporate partition-awareness into the 
dynamic programming proposal for cost-based join ordering — with a couple of 
tweaks
i) Take into account intermediate data exchanged, not just total intermediate 
data. For example, a good and simple start would be to use (exchanged-data, 
total-data) as the cost function, with a preference for the former (i.e. prefer 
lower exchanged data, and lower total-data if the exchanged data is the same). 
You could certainly have a more complex model, though. 
ii) Preserve (i.e. don't prune) partial plans based on output partitioning. 
e.g. consider a partial plan involving A, B and C. A join B join C may have a 
different output partitioning than A join C join B. If ACB is more expensive 
but has an output partitioning scheme that is useful for further joins, its 
worth preserving.

Another question I have is regarding statistics: With separate analyze 
column/analyze table statements it's possible for your statistics to have two 
different views of data, leading to weird results and inconsistent cardinality 
estimates.

For filter factor, what are the default selectivities assumed ? We may also 
want to cap the minimum selectivity, so that C1 && C2 && C3 etc. doesn’t lead 
to ridiculously low cardinality estimates.

> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17253) Left join where ON clause does not reference the right table produces analysis error

2016-09-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17253.
---
   Resolution: Duplicate
 Assignee: Herman van Hovell
Fix Version/s: 2.1.0

> Left join where ON clause does not reference the right table produces 
> analysis error
> 
>
> Key: SPARK-17253
> URL: https://issues.apache.org/jira/browse/SPARK-17253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> The following query produces an AnalysisException:
> {code}
> CREATE TABLE currency (
>  cur CHAR(3)
> );
> CREATE TABLE exchange (
>  cur1 CHAR(3),
>  cur2 CHAR(3),
>  rate double
> );
> INSERT INTO currency VALUES ('EUR');
> INSERT INTO currency VALUES ('GBP');
> INSERT INTO currency VALUES ('USD');
> INSERT INTO exchange VALUES ('EUR', 'GBP', 0.85);
> INSERT INTO exchange VALUES ('GBP', 'EUR', 1.0/0.85);
> SELECT c1.cur cur1, c2.cur cur2, COALESCE(self.rate, x.rate) rate
> FROM currency c1
> CROSS JOIN currency c2
> LEFT JOIN exchange x
>ON x.cur1=c1.cur
>AND x.cur2=c2.cur
> LEFT JOIN (SELECT 1 rate) self
>ON c1.cur=c2.cur;
> {code}
> {code}
> AnalysisException: cannot resolve '`c1.cur`' given input columns: [cur, cur1, 
> cur2, rate]; line 5 pos 13
> {code}
> However, this query is runnable in sqlite3 and postgres. This example query 
> was adapted from https://www.sqlite.org/src/tktview?name=ebdbadade5, a sqlite 
> bug report in which this query gave a wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-09-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17296.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17296) Spark SQL: cross join + two joins = BUG

2016-09-06 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-17296:
-

Assignee: Herman van Hovell

> Spark SQL: cross join + two joins = BUG
> ---
>
> Key: SPARK-17296
> URL: https://issues.apache.org/jira/browse/SPARK-17296
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> In spark shell :
> {code}
> CREATE TABLE test (col INT) ;
> INSERT OVERWRITE TABLE test VALUES (1), (2) ;
> SELECT 
> COUNT(1)
> FROM test T1 
> CROSS JOIN test T2
> JOIN test T3
> ON T3.col = T1.col
> JOIN test T4
> ON T4.col = T1.col
> ;
> {code}
> returns :
> {code}
> Error in query: cannot resolve '`T1.col`' given input columns: [col, col]; 
> line 6 pos 12
> {code}
> Apparently, this example is minimal (removing the CROSS or one of the JOIN 
> causes no issue).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17423) Support IGNORE NULLS option in Window functions

2016-09-06 Thread Tim Chan (JIRA)
Tim Chan created SPARK-17423:


 Summary: Support IGNORE NULLS option in Window functions
 Key: SPARK-17423
 URL: https://issues.apache.org/jira/browse/SPARK-17423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Tim Chan
Priority: Minor


http://stackoverflow.com/questions/24338119/is-it-possible-to-ignore-null-values-when-using-lag-and-lead-functions-in-sq





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-09-06 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468833#comment-15468833
 ] 

Jakob Odersky edited comment on SPARK-17368 at 9/6/16 10:57 PM:


So I thought about this a bit more and although it is possible to support value 
classes, I currently see two main issues that make it cumbersome:

1. Catalyst (the engine behind Datasets) generates and compiles code during 
runtime, that will represent the actual computation. This code being Java, 
together with the fact that value classes don't have runtime representations, 
will require changes in the implementation of Encoders (see my experimental 
branch 
[here|https://github.com/apache/spark/compare/master...jodersky:value-classes]).

2. The largest problem of both is how will encoders for value classes be 
accessible? Currently, encoders are exposed as type classes and there is 
unfortunately no way to create type classes for classes extending AnyVal (you 
could create an encoder for AnyVals, however that would also apply to any 
primitive type and you would get implicit resolution conflicts). Requiring 
explicit encoders for value classes may work, however you would still have no 
compile-time safety, as accessing of a value class' inner val will occur during 
runtime and may hence fail if it is not encodable.

The cleanest solution would be to use meta programming: it would guarantee 
"encodability" during compile-time and could easily complement the current API. 
Unfortunately however, I don't think it could be included in Spark in the near 
future as the current meta programming solutions in Scala are either too new 
(scala.meta) or on their way to being deprecated (the current experimental 
scala macros). (I have been wanting to experiment with meta encoders for a 
while though, so maybe I'll try putting together an external library for that)

How inconvenient is it to extract the wrapped value before creating a dataset 
and re-wrapping your final results?


was (Author: jodersky):
So I thought about this a bit more and although it is possible to support value 
classes, I currently see two main issues that make it cumbersome:

1. Catalyst (the engine behind Datasets) generates and compiles code during 
runtime, that will represent the actual computation. This code being Java, 
together with the fact that value classes don't have runtime representations, 
will require changes in the implementation of Encoders (see my experimental 
branch here).

2. The largest problem of both is how will encoders for value classes be 
accessible? Currently, encoders are exposed as type classes and there is 
unfortunately no way to create type classes for classes extending AnyVal (you 
could create an encoder for AnyVals, however that would also apply to any 
primitive type and you would get implicit resolution conflicts). Requiring 
explicit encoders for value classes may work, however you would still have no 
compile-time safety, as accessing of a value class' inner val will occur during 
runtime and may hence fail if it is not encodable.

The cleanest solution would be to use meta programming: it would guarantee 
"encodability" during compile-time and could easily complement the current API. 
Unfortunately however, I don't think it could be included in Spark in the near 
future as the current meta programming solutions in Scala are either too new 
(scala.meta) or on their way to being deprecated (the current experimental 
scala macros). (I have been wanting to experiment with meta encoders for a 
while though, so maybe I'll try putting together an external library for that)

How inconvenient is it to extract the wrapped value before creating a dataset 
and re-wrapping your final results?

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, 

[jira] [Resolved] (SPARK-15891) Make YARN logs less noisy

2016-09-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15891.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.0

> Make YARN logs less noisy
> -
>
> Key: SPARK-15891
> URL: https://issues.apache.org/jira/browse/SPARK-15891
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark can generate a lot of logs when running in YARN mode. The problem is 
> already a little bad with normal configuration, but is even worse with 
> dynamic allocation on.
> The first problem is that for every executor Spark launches, it will print 
> the whole command and all the env variables it's setting, even though those 
> are exactly the same for every executor. That's not too bad with a handful of 
> executors, but get annoying pretty soon after that. Dynamic allocation makes 
> that problem worse since there executors constantly being started and then 
> going away.
> Also, there's a lot of logging generated by the dynamic allocation backend 
> code in the YARN module. We should audit those and make sure they all make 
> sense, and whether / how to reduce the amount of logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17422) Update Ganglia project with new license

2016-09-06 Thread Luciano Resende (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luciano Resende updated SPARK-17422:

Description: 
It seems that Ganglia is now BSD licensed
http://ganglia.info/ and https://sourceforge.net/p/ganglia/code/1397/

And the library we depend on for the ganglia connector has been moved to BSD 
license as well.

We should update the ganglia spark project to leverage these license changes, 
and enable it on Spark.
 

  was:
It seems that Ganglia is now BSD licensed
http://ganglia.info/ and https://sourceforge.net/p/ganglia/code/1397/

And the library we depend on for the ganglia connector has been moved to BSD 
license as well.

We should update the ganglia spark connector/dependencies to leverage these 
license changes, and enable it on Spark.
 


> Update Ganglia project with new license
> ---
>
> Key: SPARK-17422
> URL: https://issues.apache.org/jira/browse/SPARK-17422
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>
> It seems that Ganglia is now BSD licensed
> http://ganglia.info/ and https://sourceforge.net/p/ganglia/code/1397/
> And the library we depend on for the ganglia connector has been moved to BSD 
> license as well.
> We should update the ganglia spark project to leverage these license changes, 
> and enable it on Spark.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17368) Scala value classes create encoder problems and break at runtime

2016-09-06 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468833#comment-15468833
 ] 

Jakob Odersky commented on SPARK-17368:
---

So I thought about this a bit more and although it is possible to support value 
classes, I currently see two main issues that make it cumbersome:

1. Catalyst (the engine behind Datasets) generates and compiles code during 
runtime, that will represent the actual computation. This code being Java, 
together with the fact that value classes don't have runtime representations, 
will require changes in the implementation of Encoders (see my experimental 
branch here).

2. The largest problem of both is how will encoders for value classes be 
accessible? Currently, encoders are exposed as type classes and there is 
unfortunately no way to create type classes for classes extending AnyVal (you 
could create an encoder for AnyVals, however that would also apply to any 
primitive type and you would get implicit resolution conflicts). Requiring 
explicit encoders for value classes may work, however you would still have no 
compile-time safety, as accessing of a value class' inner val will occur during 
runtime and may hence fail if it is not encodable.

The cleanest solution would be to use meta programming: it would guarantee 
"encodability" during compile-time and could easily complement the current API. 
Unfortunately however, I don't think it could be included in Spark in the near 
future as the current meta programming solutions in Scala are either too new 
(scala.meta) or on their way to being deprecated (the current experimental 
scala macros). (I have been wanting to experiment with meta encoders for a 
while though, so maybe I'll try putting together an external library for that)

How inconvenient is it to extract the wrapped value before creating a dataset 
and re-wrapping your final results?

> Scala value classes create encoder problems and break at runtime
> 
>
> Key: SPARK-17368
> URL: https://issues.apache.org/jira/browse/SPARK-17368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: JDK 8 on MacOS
> Scala 2.11.8
> Spark 2.0.0
>Reporter: Aris Vlasakakis
>
> Using Scala value classes as the inner type for Datasets breaks in Spark 2.0 
> and 1.6.X.
> This simple Spark 2 application demonstrates that the code will compile, but 
> will break at runtime with the error. The value class is of course 
> *FeatureId*, as it extends AnyVal.
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: Error while encoding: 
> java.lang.RuntimeException: Couldn't find v on int
> assertnotnull(input[0, int, true], top level non-flat input object).v AS v#0
> +- assertnotnull(input[0, int, true], top level non-flat input object).v
>+- assertnotnull(input[0, int, true], top level non-flat input object)
>   +- input[0, int, true]".
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:279)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> at 
> org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:421)
> {noformat}
> Test code for Spark 2.0.0:
> {noformat}
> import org.apache.spark.sql.{Dataset, SparkSession}
> object BreakSpark {
>   case class FeatureId(v: Int) extends AnyVal
>   def main(args: Array[String]): Unit = {
> val seq = Seq(FeatureId(1), FeatureId(2), FeatureId(3))
> val spark = SparkSession.builder.getOrCreate()
> import spark.implicits._
> spark.sparkContext.setLogLevel("warn")
> val ds: Dataset[FeatureId] = spark.createDataset(seq)
> println(s"BREAK HERE: ${ds.count}")
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17422) Update Ganglia project with new license

2016-09-06 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-17422:
---

 Summary: Update Ganglia project with new license
 Key: SPARK-17422
 URL: https://issues.apache.org/jira/browse/SPARK-17422
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
Affects Versions: 2.0.0, 1.6.2
Reporter: Luciano Resende


It seems that Ganglia is now BSD licensed
http://ganglia.info/ and https://sourceforge.net/p/ganglia/code/1397/

And the library we depend on for the ganglia connector has been moved to BSD 
license as well.

We should update the ganglia spark connector/dependencies to leverage these 
license changes, and enable it on Spark.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17421) Warnings about "MaxPermSize" parameter when building with Maven and Java 8

2016-09-06 Thread Frederick Reiss (JIRA)
Frederick Reiss created SPARK-17421:
---

 Summary: Warnings about "MaxPermSize" parameter when building with 
Maven and Java 8
 Key: SPARK-17421
 URL: https://issues.apache.org/jira/browse/SPARK-17421
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Frederick Reiss
Priority: Minor


When building Spark with {{build/mvn}} or {{dev/run-tests}}, a Java warning 
appears repeatedly on STDERR:
{{OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support 
was removed in 8.0}}

This warning is due to {{build/mvn}} adding the {{-XX:MaxPermSize=512M}} option 
to {{MAVEN_OPTS}}. When compiling with Java 7, this parameter is essential. 
With Java 8, the parameter leads to the warning above.

Because {{build/mvn}} adds {{MaxPermSize}} to {{MAVEN_OPTS}}, even if that 
environment variable doesn't contain the option, setting {{MAVEN_OPTS}} to a 
string that does not contain {{MaxPermSize}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException

2016-09-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17110.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14952
[https://github.com/apache/spark/pull/14952]

> Pyspark with locality ANY throw java.io.StreamCorruptedException
> 
>
> Key: SPARK-17110
> URL: https://issues.apache.org/jira/browse/SPARK-17110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Cluster of 2 AWS r3.xlarge slaves launched via ec2 
> scripts, Spark 2.0.0, hadoop: yarn, pyspark shell
>Reporter: Tomer Kaftan
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In Pyspark 2.0.0, any task that accesses cached data non-locally throws a 
> StreamCorruptedException like the stacktrace below:
> {noformat}
> WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): 
> java.io.StreamCorruptedException: invalid stream header: 12010A80
> at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807)
> at java.io.ObjectInputStream.(ObjectInputStream.java:302)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
> at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The simplest way I have found to reproduce this is by running the following 
> code in the pyspark shell, on a cluster of 2 slaves set to use only one 
> worker core each:
> {code}
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}
> Or by running the following via spark-submit:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17405) Simple aggregation query OOMing after SPARK-16525

2016-09-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468737#comment-15468737
 ] 

Josh Rosen commented on SPARK-17405:


On the Spark Dev list, [~jlaskowski] found a simpler example which triggers 
this issue:

{quote}
{code}
scala> val intsMM = 1 to math.pow(10, 3).toInt
intsMM: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,
58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134,
135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148,
149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162,
163, 164, 165, 166, 167, 168, 169, 1...
scala> val df = intsMM.toDF("n").withColumn("m", 'n % 2)
df: org.apache.spark.sql.DataFrame = [n: int, m: int]

scala> df.groupBy('m).agg(sum('n)).show
...
16/09/06 22:28:02 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 6)
java.lang.OutOfMemoryError: Unable to acquire 262144 bytes of memory, got 0
...
{code}

Please see 
https://gist.github.com/jaceklaskowski/906d62b830f6c967a7eee5f8eb6e9237
{quote}

> Simple aggregation query OOMing after SPARK-16525
> -
>
> Key: SPARK-17405
> URL: https://issues.apache.org/jira/browse/SPARK-17405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Prior to SPARK-16525 / https://github.com/apache/spark/pull/14176, the 
> following query ran fine via Beeline / Thrift Server and the Spark shell, but 
> after that patch it is consistently OOMING:
> {code}
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS (
>   SELECT * FROM (VALUES
> (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
> (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 
> 00:00:00.0'), CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS 
> INT), TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
> (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 
> 00:00:00.0'), CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 
> 00:00:00.0'), '211', -959, CAST(NULL AS STRING)),
> (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
> (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 
> 00:00:00.0'), CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, 
> TIMESTAMP('2028-06-27 00:00:00.0'), '-657', 948, '18'),
> (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 
> 00:00:00.0'), CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 
> 00:00:00.0'), '-345', 566, '-574'),
> (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 
> 00:00:00.0'), CAST(972 AS SMALLINT), true, CAST(NULL AS INT), 
> TIMESTAMP('2026-06-10 00:00:00.0'), '518', 683, '-320'),
> (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142')
>   ) as t);
> CREATE TEMPORARY VIEW table_3(string_col_1, float_col_2, timestamp_col_3, 
> boolean_col_4, timestamp_col_5, decimal3317_col_6) AS (
>   SELECT * FROM (VALUES
> ('88', CAST(191.92508 AS FLOAT), TIMESTAMP('1990-10-25 00:00:00.0'), 
> false, TIMESTAMP('1992-11-02 00:00:00.0'), CAST(NULL AS DECIMAL(33,17))),
> ('-419', CAST(-13.477915 AS FLOAT), TIMESTAMP('1996-03-02 00:00:00.0'), 
> true, CAST(NULL AS TIMESTAMP), -653.51000BD),
> ('970', CAST(-360.432 AS FLOAT), TIMESTAMP('2010-07-29 00:00:00.0'), 
> false, TIMESTAMP('1995-09-01 00:00:00.0'), -936.48000BD),
> ('807', CAST(814.30756 AS FLOAT), TIMESTAMP('2019-11-06 00:00:00.0'), 
> false, TIMESTAMP('1996-04-25 00:00:00.0'), 335.56000BD),
> ('-872', CAST(616.50525 AS FLOAT), TIMESTAMP('2011-08-28 00:00:00.0'), 
> false, TIMESTAMP('2003-07-19 00:00:00.0'), -951.18000BD),
> ('-167', CAST(-875.35675 AS FLOAT), TIMESTAMP('1995-07-14 00:00:00.0'), 
> false, 

[jira] [Commented] (SPARK-17381) Memory leak org.apache.spark.sql.execution.ui.SQLTaskMetrics

2016-09-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468738#comment-15468738
 ] 

Davies Liu commented on SPARK-17381:


cc [~cloud_fan]

> Memory leak  org.apache.spark.sql.execution.ui.SQLTaskMetrics
> -
>
> Key: SPARK-17381
> URL: https://issues.apache.org/jira/browse/SPARK-17381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: EMR 5.0.0 (submitted as yarn-client)
> Java Version  1.8.0_101 (Oracle Corporation)
> Scala Version version 2.11.8
> Problem also happens when I run locally with similar versions of java/scala. 
> OS: Ubuntu 16.04
>Reporter: Joao Duarte
>
> I am running a Spark Streaming application from a Kinesis stream. After some 
> hours running it gets out of memory. After a driver heap dump I found two 
> problems:
> 1) huge amount of org.apache.spark.sql.execution.ui.SQLTaskMetrics (It seems 
> this was a problem before: 
> https://issues.apache.org/jira/browse/SPARK-11192);
> To replicate the org.apache.spark.sql.execution.ui.SQLTaskMetrics leak just 
> needed to run the code below:
> {code}
> val dstream = ssc.union(kinesisStreams)
> dstream.foreachRDD((streamInfo: RDD[Array[Byte]]) => {
>   val toyDF = streamInfo.map(_ =>
> (1, "data","more data "
> ))
> .toDF("Num", "Data", "MoreData" )
>   toyDF.agg(sum("Num")).first().get(0)
> }
> )
> {code}
> 2) huge amount of Array[Byte] (9Gb+)
> After some analysis, I noticed that most of the Array[Byte] where being 
> referenced by objects that were being referenced by SQLTaskMetrics. The 
> strangest thing is that those Array[Byte] were basically text that were 
> loaded in the executors, so they should never be in the driver at all!
> Still could not replicate the 2nd problem with a simple code (the original 
> was complex with data coming from S3, DynamoDB and other databases). However, 
> when I debug the application I can see that in Executor.scala, during 
> reportHeartBeat(),  the data that should not be sent to the driver is being 
> added to "accumUpdates" which, as I understand, will be sent to the driver 
> for reporting.
> To be more precise, one of the taskRunner in the loop "for (taskRunner <- 
> runningTasks.values().asScala)"  contains a GenericInternalRow with a lot of 
> data that should not go to the driver. The path would be in my case: 
> taskRunner.task.metrics.externalAccums[2]._list[0]. This data is similar (if 
> not the same) to the data I see when I do a driver heap dump. 
> I guess that if the org.apache.spark.sql.execution.ui.SQLTaskMetrics leak is 
> fixed I would have less of this undesirable data in the driver and I could 
> run my streaming app for a long period of time, but I think there will always 
> be some performance lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17384) SQL - Running query with outer join from 1.6 fails

2016-09-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-17384.
--
Resolution: Duplicate
  Assignee: Herman van Hovell

> SQL - Running query with outer join from 1.6 fails
> --
>
> Key: SPARK-17384
> URL: https://issues.apache.org/jira/browse/SPARK-17384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>Assignee: Herman van Hovell
>
> I have some complex (10-table joins) SQL queries that utilize outer joins 
> that work fine in Spark 1.6.2, but fail under Spark 2.0.  I was able to 
> duplicate the problem using a simple test case.
> Here's the code for Spark 2.0 that doesn't run (this runs fine in Spark 
> 1.6.2):
> {code}
> case class C1(f1: String, f2: String, f3: String, f4: String)
> case class C2(g1: String, g2: String, g3: String, g4: String)
> case class C3(h1: String, h2: String, h3: String, h4: String)
> val sqlContext = spark.sqlContext 
> val c1 = sc.parallelize(Seq(
>   C1("h1", "c1a1", "c1b1", "c1c1"),
>   C1("h2", "c1a2", "c1b2", "c1c2"),
>   C1(null, "c1a3", "c1b3", "c1c3")
>   )).toDF
> c1.createOrReplaceTempView("c1")
> val c2 = sc.parallelize(Seq(
>   C2("h1", "c2a1", "c2b1", "c2c1"),
>   C2("h2", "c2a2", "c2b2", "c2c2"),
>   C2(null, "c2a3", "c2b3", "c2c3"),
>   C2(null, "c2a4", "c2b4", "c2c4"),
>   C2("h333", "c2a333", "c2b333", "c2c333")
>   )).toDF
> c2.createOrReplaceTempView("c2")
> val c3 = sc.parallelize(Seq(
>   C3("h1", "c3a1", "c3b1", "c3c1"),
>   C3("h2", "c3a2", "c3b2", "c3c2"),
>   C3(null, "c3a3", "c3b3", "c3c3")
>   )).toDF
> c3.createOrReplaceTempView("c3")
> // doesn't work in Spark 2.0, works in Spark 1.6
> val bad_df = sqlContext.sql("""
>   select * 
>   from c1, c3
>   left outer join c2 on (c1.f1 = c2.g1)
>   where c1.f1 = c3.h1
> """).show()
> // works in both
> val works_df = sqlContext.sql("""
>   select * 
>   from c1
>   left outer join c2 on (c1.f1 = c2.g1), 
>   c3
>   where c1.f1 = c3.h1
> """).show()
> {code}
> Here's the output after running bad_df in Spark 2.0:
> {code}
> scala> val bad_df = sqlContext.sql("""
>  |   select *
>  |   from c1, c3
>  |   left outer join c2 on (c1.f1 = c2.g1)
>  |   where c1.f1 = c3.h1
>  | """).show()
> org.apache.spark.sql.AnalysisException: cannot resolve '`c1.f1`' given input 
> columns: [h3, g3, h4, g2, g4, h2, h1, g1]; line 4 pos 25
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at 

[jira] [Commented] (SPARK-17384) SQL - Running query with outer join from 1.6 fails

2016-09-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468636#comment-15468636
 ] 

Davies Liu commented on SPARK-17384:


This is caused by the SQL parser change, the parsed plan in 1.6:
{code}
'Project [unresolvedalias(*)]
+- 'Filter ('c1.f1 = 'c3.h1)
   +- 'Join LeftOuter, Some(('c3.h1 = 'c2.g1))
  :- 'Join Inner, None
  :  :- 'UnresolvedRelation `c1`, None
  :  +- 'UnresolvedRelation `c3`, None
  +- 'UnresolvedRelation `c2`, None
{code}


{code}
= Parsed Logical Plan ==
'Project [*]
+- 'Filter ('c1.f1 = 'c3.h1)
   +- 'Join Inner
  :- 'UnresolvedRelation `c1`
  +- 'Join LeftOuter, ('c3.h1 = 'c2.g1)
 :- 'UnresolvedRelation `c3`
 +- 'UnresolvedRelation `c2`
{code}

cc [~hvanhovell]

> SQL - Running query with outer join from 1.6 fails
> --
>
> Key: SPARK-17384
> URL: https://issues.apache.org/jira/browse/SPARK-17384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>
> I have some complex (10-table joins) SQL queries that utilize outer joins 
> that work fine in Spark 1.6.2, but fail under Spark 2.0.  I was able to 
> duplicate the problem using a simple test case.
> Here's the code for Spark 2.0 that doesn't run (this runs fine in Spark 
> 1.6.2):
> {code}
> case class C1(f1: String, f2: String, f3: String, f4: String)
> case class C2(g1: String, g2: String, g3: String, g4: String)
> case class C3(h1: String, h2: String, h3: String, h4: String)
> val sqlContext = spark.sqlContext 
> val c1 = sc.parallelize(Seq(
>   C1("h1", "c1a1", "c1b1", "c1c1"),
>   C1("h2", "c1a2", "c1b2", "c1c2"),
>   C1(null, "c1a3", "c1b3", "c1c3")
>   )).toDF
> c1.createOrReplaceTempView("c1")
> val c2 = sc.parallelize(Seq(
>   C2("h1", "c2a1", "c2b1", "c2c1"),
>   C2("h2", "c2a2", "c2b2", "c2c2"),
>   C2(null, "c2a3", "c2b3", "c2c3"),
>   C2(null, "c2a4", "c2b4", "c2c4"),
>   C2("h333", "c2a333", "c2b333", "c2c333")
>   )).toDF
> c2.createOrReplaceTempView("c2")
> val c3 = sc.parallelize(Seq(
>   C3("h1", "c3a1", "c3b1", "c3c1"),
>   C3("h2", "c3a2", "c3b2", "c3c2"),
>   C3(null, "c3a3", "c3b3", "c3c3")
>   )).toDF
> c3.createOrReplaceTempView("c3")
> // doesn't work in Spark 2.0, works in Spark 1.6
> val bad_df = sqlContext.sql("""
>   select * 
>   from c1, c3
>   left outer join c2 on (c1.f1 = c2.g1)
>   where c1.f1 = c3.h1
> """).show()
> // works in both
> val works_df = sqlContext.sql("""
>   select * 
>   from c1
>   left outer join c2 on (c1.f1 = c2.g1), 
>   c3
>   where c1.f1 = c3.h1
> """).show()
> {code}
> Here's the output after running bad_df in Spark 2.0:
> {code}
> scala> val bad_df = sqlContext.sql("""
>  |   select *
>  |   from c1, c3
>  |   left outer join c2 on (c1.f1 = c2.g1)
>  |   where c1.f1 = c3.h1
>  | """).show()
> org.apache.spark.sql.AnalysisException: cannot resolve '`c1.f1`' given input 
> columns: [h3, g3, h4, g2, g4, h2, h1, g1]; line 4 pos 25
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> 

[jira] [Commented] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-09-06 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468622#comment-15468622
 ] 

Dhruve Ashar commented on SPARK-17417:
--

Thanks for the suggestion. I'll work on the changes and submit a PR.

> Fix # of partitions for RDD while checkpointing - Currently limited by 
> 1(%05d)
> --
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 10 and uses %05d 
> padding. 
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
> and fails. This is because of part-files are sorted and compared as strings. 
> This leads filename order to be part-1, part-10, ... instead of 
> part-1, part-10001, ..., part-10 and while reconstructing the 
> checkpointed RDD the job fails. 
> Possible solutions: 
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the 
> RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-09-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17299:
--
Assignee: Sandeep Singh

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Assignee: Sandeep Singh
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17299) TRIM/LTRIM/RTRIM strips characters other than spaces

2016-09-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17299.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14924
[https://github.com/apache/spark/pull/14924]

> TRIM/LTRIM/RTRIM strips characters other than spaces
> 
>
> Key: SPARK-17299
> URL: https://issues.apache.org/jira/browse/SPARK-17299
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Jeremy Beard
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> TRIM/LTRIM/RTRIM docs state that they only strip spaces:
> http://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#trim(org.apache.spark.sql.Column)
> But the implementation strips all characters of ASCII value 20 or less:
> https://github.com/apache/spark/blob/v2.0.0/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L468-L470



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17378) Upgrade snappy-java to 1.1.2.6

2016-09-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17378.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

Issue resolved by pull request 14958
[https://github.com/apache/spark/pull/14958]

> Upgrade snappy-java to 1.1.2.6
> --
>
> Key: SPARK-17378
> URL: https://issues.apache.org/jira/browse/SPARK-17378
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Trivial
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> We should upgrade the snappy-java version to pick up the following fix we see 
> at https://github.com/xerial/snappy-java/blob/master/Milestone.md
> More info on the fix with one user commenting it impacts kafka 0.10 
> (http://www.confluent.io/blog/announcing-apache-kafka-0.10-and-confluent-platform-3.0):
>  https://github.com/xerial/snappy-java/issues/142
> Fix will involve the pom and the dev/deps files which mention versions of 
> each library we use.
> I'll create a PR and not expecting new failures to pop up...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17377) Joining Datasets read and aggregated from a partitioned Parquet file gives wrong results

2016-09-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468583#comment-15468583
 ] 

Davies Liu commented on SPARK-17377:


Tested this with latest master and 2.0 on databricks[1], they both worked well.

[1] 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4295659042175502/2244397401000511/8311518366344995/latest.html

> Joining Datasets read and aggregated from a partitioned Parquet file gives 
> wrong results
> 
>
> Key: SPARK-17377
> URL: https://issues.apache.org/jira/browse/SPARK-17377
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Hanna Mäki
>
> Reproduction: 
> 1) Read two Datasets from a partitioned Parquet file with different filter 
> conditions on the partitioning column
> 2) Group by a column and aggregate the two data sets
> 3) Join the aggregated Datasets on the group by column
> 4) In the joined dataset, the aggregated values from the right Dataset have 
> been replaced with the aggregated values from the left Dataset 
> The issue is only reproduced when the input parquet file is partitioned.
> Example: 
> {code}
> val dataPath= "/your/data/path/" 
> case class InputData(id: Int, value: Int, filterColumn: Int)
> val inputDS = Seq(InputData(1, 1, 1), InputData(2, 2, 1), InputData(3, 3, 1), 
> InputData(4, 4, 1), InputData(1, 10, 2), InputData(2, 20, 2), InputData(3, 
> 30, 2), InputData(4, 40, 2)).toDS()
> inputDS.show
> | id|value|filterColumn|
> |  1|1|   1|
> |  2|2|   1|
> |  3|3|   1|
> |  4|4|   1|
> |  1|   10|   2|
> |  2|   20|   2|
> |  3|   30|   2|
> |  4|   40|   2|
> inputDS.write.partitionBy("filterColumn").parquet(dataPath)
> val dataDF = spark.read.parquet(dataPath)
> case class LeftClass(id: Int, aggLeft: Long)
> case class RightClass(id: Int, aggRight: Long)
> val leftDS = dataDF.filter("filterColumn = 1").groupBy("id").agg(sum("value") 
> as "aggLeft").as[LeftClass]
> val rightDS = dataDF.filter("filterColumn = 
> 2").groupBy("id").agg(sum("value") as "aggRight").as[RightClass]
> leftDS.show
> | id|aggLeft|
> |  1|  1|
> |  3|  3|
> |  4|  4|
> |  2|  2|
> rightDS.show
> | id|aggRight|
> |  1|  10|
> |  3|  30|
> |  4|  40|
> |  2|  20|
> val joinedDS = leftDS.join(rightDS,"id")
> joinedDS.show
> | id|aggLeft|aggRight|
> |  1|  1|   1|
> |  3|  3|   3|
> |  4|  4|   4|
> |  2|  2|   2|
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17377) Joining Datasets read and aggregated from a partitioned Parquet file gives wrong results

2016-09-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-17377:
--

Assignee: Davies Liu

> Joining Datasets read and aggregated from a partitioned Parquet file gives 
> wrong results
> 
>
> Key: SPARK-17377
> URL: https://issues.apache.org/jira/browse/SPARK-17377
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Hanna Mäki
>Assignee: Davies Liu
>
> Reproduction: 
> 1) Read two Datasets from a partitioned Parquet file with different filter 
> conditions on the partitioning column
> 2) Group by a column and aggregate the two data sets
> 3) Join the aggregated Datasets on the group by column
> 4) In the joined dataset, the aggregated values from the right Dataset have 
> been replaced with the aggregated values from the left Dataset 
> The issue is only reproduced when the input parquet file is partitioned.
> Example: 
> {code}
> val dataPath= "/your/data/path/" 
> case class InputData(id: Int, value: Int, filterColumn: Int)
> val inputDS = Seq(InputData(1, 1, 1), InputData(2, 2, 1), InputData(3, 3, 1), 
> InputData(4, 4, 1), InputData(1, 10, 2), InputData(2, 20, 2), InputData(3, 
> 30, 2), InputData(4, 40, 2)).toDS()
> inputDS.show
> | id|value|filterColumn|
> |  1|1|   1|
> |  2|2|   1|
> |  3|3|   1|
> |  4|4|   1|
> |  1|   10|   2|
> |  2|   20|   2|
> |  3|   30|   2|
> |  4|   40|   2|
> inputDS.write.partitionBy("filterColumn").parquet(dataPath)
> val dataDF = spark.read.parquet(dataPath)
> case class LeftClass(id: Int, aggLeft: Long)
> case class RightClass(id: Int, aggRight: Long)
> val leftDS = dataDF.filter("filterColumn = 1").groupBy("id").agg(sum("value") 
> as "aggLeft").as[LeftClass]
> val rightDS = dataDF.filter("filterColumn = 
> 2").groupBy("id").agg(sum("value") as "aggRight").as[RightClass]
> leftDS.show
> | id|aggLeft|
> |  1|  1|
> |  3|  3|
> |  4|  4|
> |  2|  2|
> rightDS.show
> | id|aggRight|
> |  1|  10|
> |  3|  30|
> |  4|  40|
> |  2|  20|
> val joinedDS = leftDS.join(rightDS,"id")
> joinedDS.show
> | id|aggLeft|aggRight|
> |  1|  1|   1|
> |  3|  3|   3|
> |  4|  4|   4|
> |  2|  2|   2|
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17316) Don't block StandaloneSchedulerBackend.executorRemoved

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468566#comment-15468566
 ] 

Apache Spark commented on SPARK-17316:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14983

> Don't block StandaloneSchedulerBackend.executorRemoved
> --
>
> Key: SPARK-17316
> URL: https://issues.apache.org/jira/browse/SPARK-17316
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It 
> may cause some deadlock since it's called inside 
> StandaloneAppClient.ClientEndpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17377) Joining Datasets read and aggregated from a partitioned Parquet file gives wrong results

2016-09-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17377:
---
Description: 
Reproduction: 

1) Read two Datasets from a partitioned Parquet file with different filter 
conditions on the partitioning column
2) Group by a column and aggregate the two data sets
3) Join the aggregated Datasets on the group by column
4) In the joined dataset, the aggregated values from the right Dataset have 
been replaced with the aggregated values from the left Dataset 

The issue is only reproduced when the input parquet file is partitioned.

Example: 
{code}

val dataPath= "/your/data/path/" 

case class InputData(id: Int, value: Int, filterColumn: Int)

val inputDS = Seq(InputData(1, 1, 1), InputData(2, 2, 1), InputData(3, 3, 1), 
InputData(4, 4, 1), InputData(1, 10, 2), InputData(2, 20, 2), InputData(3, 30, 
2), InputData(4, 40, 2)).toDS()

inputDS.show
| id|value|filterColumn|
|  1|1|   1|
|  2|2|   1|
|  3|3|   1|
|  4|4|   1|
|  1|   10|   2|
|  2|   20|   2|
|  3|   30|   2|
|  4|   40|   2|

inputDS.write.partitionBy("filterColumn").parquet(dataPath)

val dataDF = spark.read.parquet(dataPath)

case class LeftClass(id: Int, aggLeft: Long)

case class RightClass(id: Int, aggRight: Long)

val leftDS = dataDF.filter("filterColumn = 1").groupBy("id").agg(sum("value") 
as "aggLeft").as[LeftClass]

val rightDS = dataDF.filter("filterColumn = 2").groupBy("id").agg(sum("value") 
as "aggRight").as[RightClass]

leftDS.show
| id|aggLeft|
|  1|  1|
|  3|  3|
|  4|  4|
|  2|  2|

rightDS.show
| id|aggRight|
|  1|  10|
|  3|  30|
|  4|  40|
|  2|  20|

val joinedDS = leftDS.join(rightDS,"id")
joinedDS.show
| id|aggLeft|aggRight|
|  1|  1|   1|
|  3|  3|   3|
|  4|  4|   4|
|  2|  2|   2|
{code}

  was:
Reproduction: 

1) Read two Datasets from a partitioned Parquet file with different filter 
conditions on the partitioning column
2) Group by a column and aggregate the two data sets
3) Join the aggregated Datasets on the group by column
4) In the joined dataset, the aggregated values from the right Dataset have 
been replaced with the aggregated values from the left Dataset 

The issue is only reproduced when the input parquet file is partitioned.

Example: 

val dataPath= "/your/data/path/" 

case class InputData(id: Int, value: Int, filterColumn: Int)

val inputDS = Seq(InputData(1, 1, 1), InputData(2, 2, 1), InputData(3, 3, 1), 
InputData(4, 4, 1), InputData(1, 10, 2), InputData(2, 20, 2), InputData(3, 30, 
2), InputData(4, 40, 2)).toDS()

inputDS.show
| id|value|filterColumn|
|  1|1|   1|
|  2|2|   1|
|  3|3|   1|
|  4|4|   1|
|  1|   10|   2|
|  2|   20|   2|
|  3|   30|   2|
|  4|   40|   2|

inputDS.write.partitionBy("filterColumn").parquet(dataPath)

val dataDF = spark.read.parquet(dataPath)

case class LeftClass(id: Int, aggLeft: Long)

case class RightClass(id: Int, aggRight: Long)

val leftDS = dataDF.filter("filterColumn = 1").groupBy("id").agg(sum("value") 
as "aggLeft").as[LeftClass]

val rightDS = dataDF.filter("filterColumn = 2").groupBy("id").agg(sum("value") 
as "aggRight").as[RightClass]

leftDS.show
| id|aggLeft|
|  1|  1|
|  3|  3|
|  4|  4|
|  2|  2|

rightDS.show
| id|aggRight|
|  1|  10|
|  3|  30|
|  4|  40|
|  2|  20|

val joinedDS = leftDS.join(rightDS,"id")
joinedDS.show
| id|aggLeft|aggRight|
|  1|  1|   1|
|  3|  3|   3|
|  4|  4|   4|
|  2|  2|   2|


> Joining Datasets read and aggregated from a partitioned Parquet file gives 
> wrong results
> 
>
> Key: SPARK-17377
> URL: https://issues.apache.org/jira/browse/SPARK-17377
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Hanna Mäki
>
> Reproduction: 
> 1) Read two Datasets from a partitioned Parquet file with different filter 
> conditions on the partitioning column
> 2) Group by a column and aggregate the two data sets
> 3) Join the aggregated Datasets on the group by column
> 4) In the joined dataset, the aggregated values from the right Dataset have 
> been replaced with the aggregated values from the left Dataset 
> The issue is only reproduced when the input parquet file is partitioned.
> Example: 
> {code}
> val dataPath= "/your/data/path/" 
> case class InputData(id: Int, value: Int, filterColumn: Int)
> val inputDS = Seq(InputData(1, 1, 1), InputData(2, 2, 1), InputData(3, 3, 1), 
> InputData(4, 4, 1), InputData(1, 10, 2), InputData(2, 20, 2), InputData(3, 
> 30, 2), InputData(4, 40, 2)).toDS()
> inputDS.show
> | id|value|filterColumn|
> |  1|1| 

[jira] [Commented] (SPARK-17420) Install rmarkdown R package on Jenkins machines

2016-09-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468539#comment-15468539
 ] 

Shivaram Venkataraman commented on SPARK-17420:
---

This came up in https://github.com/apache/spark/pull/14980

cc [~junyangq] [~shaneknapp] 

> Install rmarkdown R package on Jenkins machines
> ---
>
> Key: SPARK-17420
> URL: https://issues.apache.org/jira/browse/SPARK-17420
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Reporter: Shivaram Venkataraman
>
> For building SparkR vignettes on Jenkins machines, we need the rmarkdown R.  
> The package is available at 
> https://cran.r-project.org/web/packages/rmarkdown/index.html - I think 
> running something like
> Rscript -e 'install.packages("rmarkdown", repos="http://cran.stat.ucla.edu/;)'
> should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17420) Install rmarkdown R package on Jenkins machines

2016-09-06 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-17420:
-

 Summary: Install rmarkdown R package on Jenkins machines
 Key: SPARK-17420
 URL: https://issues.apache.org/jira/browse/SPARK-17420
 Project: Spark
  Issue Type: Improvement
  Components: Build, SparkR
Reporter: Shivaram Venkataraman


For building SparkR vignettes on Jenkins machines, we need the rmarkdown R.  
The package is available at 
https://cran.r-project.org/web/packages/rmarkdown/index.html - I think running 
something like
Rscript -e 'install.packages("rmarkdown", repos="http://cran.stat.ucla.edu/;)'

should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings

2016-09-06 Thread Ruben Hernando (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468528#comment-15468528
 ] 

Ruben Hernando commented on SPARK-17403:


I'm sorry I can't share the data.

This is a 2 tables join, where each table has near 37 million records, and one 
of them about 40 columns. As far as I can see, string columns are identifiers 
(many null values)

btw, as problematic frame is "org.apache.spark.unsafe.Platform.getLong", 
wouldn't it mean to be a Long field?

> Fatal Error: Scan cached strings
> 
>
> Key: SPARK-17403
> URL: https://issues.apache.org/jira/browse/SPARK-17403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark standalone cluster (3 Workers, 47 cores)
> Ubuntu 14
> Java 8
>Reporter: Ruben Hernando
>
> The process creates views from JDBC (SQL server) source and combines them to 
> create other views.
> Finally it dumps results via JDBC
> Error:
> {quote}
> # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 
> 1.8.0_101-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # J 4895 C1 org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J (9 
> bytes) @ 0x7fbb355dfd6c [0x7fbb355dfd60+0xc]
> #
> {quote}
> SQL Query plan (fields truncated):
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation `COEQ_63`
> == Analyzed Logical Plan ==
> InstanceId: bigint, price: double, ZoneId: int, priceItemId: int, priceId: int
> Project [InstanceId#20236L, price#20237, ZoneId#20239, priceItemId#20242, 
> priceId#20244]
> +- SubqueryAlias coeq_63
>+- Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
>   +- SubqueryAlias 6__input
>  +- 
> Relation[_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,TableP_DCID#148,TableSHID#149,SL_ACT_GI_DTE#150,SL_Xcl_C#151,SL_Xcl_C#152,SL_Css_Cojs#153L,SL_Config#154,SL_CREATEDON#
>  .. 36 more fields] JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables]. ... [...]  FROM [sch].[SLTables] [SLTables] JOIN sch.TPSLTables 
> TPSLTables ON [TPSLTables].[_TableSL_SID] = [SLTables].[_TableSL_SID] where 
> _TP = 24) input)
> == Optimized Logical Plan ==
> Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
> +- InMemoryRelation [_TableSL_SID#143L, _TableP_DC_SID#144L, 
> _TableSH_SID#145L, ID#146, Name#147, ... 36 more fields], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>:  +- *Scan JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables].[_TableP_DC_SID], [SLTables].[_TableSH_SID], [SLTables].[ID], 
> [SLTables].[Name], [SLTables].[TableP_DCID], [SLTables].[TableSHID], 
> [TPSLTables].[SL_ACT_GI_DTE],  ... [...] FROM [sch].[SLTables] [SLTables] 
> JOIN sch.TPSLTables TPSLTables ON [TPSLTables].[_TableSL_SID] = 
> [SLTables].[_TableSL_SID] where _TP = 24) input) 
> [_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,TableP_DCID#148,TableSHID#149,SL_ACT_GI_DTE#150,SL_Xcl_C#151,...
>  36 more fields] 
> == Physical Plan ==
> *Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
> +- InMemoryTableScan [_TableSL_SID#143L, SL_RD_ColR_N#189]
>:  +- InMemoryRelation [_TableSL_SID#143L, _TableP_DC_SID#144L, 
> _TableSH_SID#145L, ID#146, Name#147, ... 36 more fields], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: :  +- *Scan JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables].[_TableP_DC_SID], [SLTables].[_TableSH_SID], [SLTables].[ID], 
> [SLTables].[Name], [SLTables].[TableP_DCID],  ... [...] FROM [sch].[SLTables] 
> [SLTables] JOIN sch.TPSLTables TPSLTables ON [TPSLTables].[_TableSL_SID] = 
> [SLTables].[_TableSL_SID] where _TP = 24) input) 
> [_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,,... 
> 36 more fields]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2016-09-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468525#comment-15468525
 ] 

Joseph K. Bradley commented on SPARK-11478:
---

I'm not sure if it was on purpose.  I could see two arguments:
* Should not be nullable: MLlib algorithms all assume data are complete, with 
no missing fields.
* Should be nullable: Algorithms could (should?) be modified to support 
nullable values.

Is this issue a blocker for any workloads, or is it just an oddity?  I'll 
downgrade it to minor unless someone protests.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11478) ML StringIndexer return inconsistent schema

2016-09-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11478:
--
Priority: Minor  (was: Major)

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-09-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468519#comment-15468519
 ] 

Sean Owen commented on SPARK-17417:
---

I'd bump the padding to allow 10 digits, because that would accommodate a 
32-bit int, and having that many partitions would cause other things to fail. 
As long as the parsing code can read the 'old format' too, should work fine.

> Fix # of partitions for RDD while checkpointing - Currently limited by 
> 1(%05d)
> --
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 10 and uses %05d 
> padding. 
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
> and fails. This is because of part-files are sorted and compared as strings. 
> This leads filename order to be part-1, part-10, ... instead of 
> part-1, part-10001, ..., part-10 and while reconstructing the 
> checkpointed RDD the job fails. 
> Possible solutions: 
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the 
> RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17419) Mesos virtual network support

2016-09-06 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-17419:
---

 Summary: Mesos virtual network support
 Key: SPARK-17419
 URL: https://issues.apache.org/jira/browse/SPARK-17419
 Project: Spark
  Issue Type: Task
  Components: Mesos
Reporter: Michael Gummelt


http://mesos.apache.org/documentation/latest/cni/

This will enable launching executors into virtual networks for isolation and 
security. It will also enable container per IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization

2016-09-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468475#comment-15468475
 ] 

Joseph K. Bradley commented on SPARK-17201:
---

Does this actually resolve the issue?

The quotes above from [~sethah]'s post say:
* A positive semidefinite Hessian can cause numerical instability.
* L-BFGS preserves the property of being positive definite across iterations.

However, this is not a proof that L-BFGS will not have numerical stability 
problems.  E.g., it does not preclude issues such as the approximation of the 
Hessian becoming increasingly poorly conditioned with each iteration.

+1 for rigorous testing!

> Investigate numerical instability for MLOR without regularization
> -
>
> Key: SPARK-17201
> URL: https://issues.apache.org/jira/browse/SPARK-17201
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> As mentioned 
> [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no 
> regularization is applied in Softmax regression, second order Newton solvers 
> may run into numerical instability problems. We should investigate this in 
> practice and find a solution, possibly by implementing pivoting when no 
> regularization is applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings

2016-09-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468449#comment-15468449
 ] 

Davies Liu commented on SPARK-17403:


[~rhernando] Could you pull out the string column (SL_RD_ColR_N) and dump it as 
parquet file to reproduce the issue here? There could a bug in scanning cached 
strings (depends on the data). 

> Fatal Error: Scan cached strings
> 
>
> Key: SPARK-17403
> URL: https://issues.apache.org/jira/browse/SPARK-17403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark standalone cluster (3 Workers, 47 cores)
> Ubuntu 14
> Java 8
>Reporter: Ruben Hernando
>
> The process creates views from JDBC (SQL server) source and combines them to 
> create other views.
> Finally it dumps results via JDBC
> Error:
> {quote}
> # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 
> 1.8.0_101-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # J 4895 C1 org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J (9 
> bytes) @ 0x7fbb355dfd6c [0x7fbb355dfd60+0xc]
> #
> {quote}
> SQL Query plan (fields truncated):
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation `COEQ_63`
> == Analyzed Logical Plan ==
> InstanceId: bigint, price: double, ZoneId: int, priceItemId: int, priceId: int
> Project [InstanceId#20236L, price#20237, ZoneId#20239, priceItemId#20242, 
> priceId#20244]
> +- SubqueryAlias coeq_63
>+- Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
>   +- SubqueryAlias 6__input
>  +- 
> Relation[_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,TableP_DCID#148,TableSHID#149,SL_ACT_GI_DTE#150,SL_Xcl_C#151,SL_Xcl_C#152,SL_Css_Cojs#153L,SL_Config#154,SL_CREATEDON#
>  .. 36 more fields] JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables]. ... [...]  FROM [sch].[SLTables] [SLTables] JOIN sch.TPSLTables 
> TPSLTables ON [TPSLTables].[_TableSL_SID] = [SLTables].[_TableSL_SID] where 
> _TP = 24) input)
> == Optimized Logical Plan ==
> Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
> +- InMemoryRelation [_TableSL_SID#143L, _TableP_DC_SID#144L, 
> _TableSH_SID#145L, ID#146, Name#147, ... 36 more fields], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>:  +- *Scan JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables].[_TableP_DC_SID], [SLTables].[_TableSH_SID], [SLTables].[ID], 
> [SLTables].[Name], [SLTables].[TableP_DCID], [SLTables].[TableSHID], 
> [TPSLTables].[SL_ACT_GI_DTE],  ... [...] FROM [sch].[SLTables] [SLTables] 
> JOIN sch.TPSLTables TPSLTables ON [TPSLTables].[_TableSL_SID] = 
> [SLTables].[_TableSL_SID] where _TP = 24) input) 
> [_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,TableP_DCID#148,TableSHID#149,SL_ACT_GI_DTE#150,SL_Xcl_C#151,...
>  36 more fields] 
> == Physical Plan ==
> *Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
> +- InMemoryTableScan [_TableSL_SID#143L, SL_RD_ColR_N#189]
>:  +- InMemoryRelation [_TableSL_SID#143L, _TableP_DC_SID#144L, 
> _TableSH_SID#145L, ID#146, Name#147, ... 36 more fields], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: :  +- *Scan JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables].[_TableP_DC_SID], [SLTables].[_TableSH_SID], [SLTables].[ID], 
> [SLTables].[Name], [SLTables].[TableP_DCID],  ... [...] FROM [sch].[SLTables] 
> [SLTables] JOIN sch.TPSLTables TPSLTables ON [TPSLTables].[_TableSL_SID] = 
> [SLTables].[_TableSL_SID] where _TP = 24) input) 
> [_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,,... 
> 36 more fields]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468442#comment-15468442
 ] 

Sean Owen commented on SPARK-17418:
---

Aha. The problem is this: 
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kinesis-asl-assembly_2.10
 It's not part of the source/binary distribution but an optional module 
published to Maven, and the assembly here actually does include the Kinesis 
binary code.

> Spark release must NOT distribute Kinesis related artifacts
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Priority: Critical
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17403) Fatal Error: Scan cached strings

2016-09-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17403:
---
Summary: Fatal Error: Scan cached strings  (was: Fatal Error: SIGSEGV on 
Jdbc joins)

> Fatal Error: Scan cached strings
> 
>
> Key: SPARK-17403
> URL: https://issues.apache.org/jira/browse/SPARK-17403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark standalone cluster (3 Workers, 47 cores)
> Ubuntu 14
> Java 8
>Reporter: Ruben Hernando
>
> The process creates views from JDBC (SQL server) source and combines them to 
> create other views.
> Finally it dumps results via JDBC
> Error:
> {quote}
> # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 
> 1.8.0_101-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # J 4895 C1 org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J (9 
> bytes) @ 0x7fbb355dfd6c [0x7fbb355dfd60+0xc]
> #
> {quote}
> SQL Query plan (fields truncated):
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation `COEQ_63`
> == Analyzed Logical Plan ==
> InstanceId: bigint, price: double, ZoneId: int, priceItemId: int, priceId: int
> Project [InstanceId#20236L, price#20237, ZoneId#20239, priceItemId#20242, 
> priceId#20244]
> +- SubqueryAlias coeq_63
>+- Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
>   +- SubqueryAlias 6__input
>  +- 
> Relation[_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,TableP_DCID#148,TableSHID#149,SL_ACT_GI_DTE#150,SL_Xcl_C#151,SL_Xcl_C#152,SL_Css_Cojs#153L,SL_Config#154,SL_CREATEDON#
>  .. 36 more fields] JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables]. ... [...]  FROM [sch].[SLTables] [SLTables] JOIN sch.TPSLTables 
> TPSLTables ON [TPSLTables].[_TableSL_SID] = [SLTables].[_TableSL_SID] where 
> _TP = 24) input)
> == Optimized Logical Plan ==
> Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
> +- InMemoryRelation [_TableSL_SID#143L, _TableP_DC_SID#144L, 
> _TableSH_SID#145L, ID#146, Name#147, ... 36 more fields], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>:  +- *Scan JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables].[_TableP_DC_SID], [SLTables].[_TableSH_SID], [SLTables].[ID], 
> [SLTables].[Name], [SLTables].[TableP_DCID], [SLTables].[TableSHID], 
> [TPSLTables].[SL_ACT_GI_DTE],  ... [...] FROM [sch].[SLTables] [SLTables] 
> JOIN sch.TPSLTables TPSLTables ON [TPSLTables].[_TableSL_SID] = 
> [SLTables].[_TableSL_SID] where _TP = 24) input) 
> [_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,TableP_DCID#148,TableSHID#149,SL_ACT_GI_DTE#150,SL_Xcl_C#151,...
>  36 more fields] 
> == Physical Plan ==
> *Project [_TableSL_SID#143L AS InstanceId#20236L, SL_RD_ColR_N#189 AS 
> price#20237, 24 AS ZoneId#20239, 6 AS priceItemId#20242, 63 AS priceId#20244]
> +- InMemoryTableScan [_TableSL_SID#143L, SL_RD_ColR_N#189]
>:  +- InMemoryRelation [_TableSL_SID#143L, _TableP_DC_SID#144L, 
> _TableSH_SID#145L, ID#146, Name#147, ... 36 more fields], true, 1, 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: :  +- *Scan JDBCRelation((select [SLTables].[_TableSL_SID], 
> [SLTables].[_TableP_DC_SID], [SLTables].[_TableSH_SID], [SLTables].[ID], 
> [SLTables].[Name], [SLTables].[TableP_DCID],  ... [...] FROM [sch].[SLTables] 
> [SLTables] JOIN sch.TPSLTables TPSLTables ON [TPSLTables].[_TableSL_SID] = 
> [SLTables].[_TableSL_SID] where _TP = 24) input) 
> [_TableSL_SID#143L,_TableP_DC_SID#144L,_TableSH_SID#145L,ID#146,Name#147,,... 
> 36 more fields]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468410#comment-15468410
 ] 

Sean Owen commented on SPARK-17418:
---

I don't see any Kinesis artifacts in the Spark distribution. Where do you see 
them? I grepped the whole project's source and all the .jar files contents. 

> Spark release must NOT distribute Kinesis related artifacts
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Priority: Critical
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17407) Unable to update structured stream from CSV

2016-09-06 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468372#comment-15468372
 ] 

Seth Hendrickson commented on SPARK-17407:
--

[~chriddyp] The file stream source checks for new files by using the file name. 
Appending new rows to a file that already exists has no effect, by design. We 
can discuss whether the design ought to change, but as far as I can see nothing 
is "wrong" here. If you want to update a streaming csv dataframe, then just add 
new files.

> Unable to update structured stream from CSV
> ---
>
> Key: SPARK-17407
> URL: https://issues.apache.org/jira/browse/SPARK-17407
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Mac OSX
> Spark 2.0.0
>Reporter: Chris Parmer
>Priority: Trivial
>  Labels: beginner, newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I am creating a simple example of a Structured Stream from a CSV file with an 
> in-memory output stream.
> When I add rows the CSV file, my output stream does not update with the new 
> data. From this example: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4012078893478893/3202384642551446/5985939988045659/latest.html,
>  I expected that subsequent queries on the same output stream would contain 
> updated results.
> Here is a reproducable code example: https://plot.ly/~chris/17703
> Thanks for the help here!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468362#comment-15468362
 ] 

Apache Spark commented on SPARK-17418:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/14981

> Spark release must NOT distribute Kinesis related artifacts
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Priority: Critical
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-09-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468364#comment-15468364
 ] 

Davies Liu commented on SPARK-16922:


Is there any performance difference comparing to BytesToBytesMap?

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17418:


Assignee: Apache Spark

> Spark release must NOT distribute Kinesis related artifacts
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Assignee: Apache Spark
>Priority: Critical
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17418:


Assignee: (was: Apache Spark)

> Spark release must NOT distribute Kinesis related artifacts
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Priority: Critical
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Luciano Resende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468348#comment-15468348
 ] 

Luciano Resende commented on SPARK-17418:
-

I am going to create a PR for this, basically removing from the release publish 
process. Will have to see what are the side effects on the overall build as 
there is embedded support for it over python and samples (which seems to convey 
that this is not really an optional package) 

> Spark release must NOT distribute Kinesis related artifacts
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Priority: Critical
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17418) Spark release must NOT distribute Kinesis related artifacts

2016-09-06 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-17418:
---

 Summary: Spark release must NOT distribute Kinesis related 
artifacts
 Key: SPARK-17418
 URL: https://issues.apache.org/jira/browse/SPARK-17418
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
Affects Versions: 2.0.0, 1.6.2
Reporter: Luciano Resende
Priority: Critical


The Kinesis streaming connector is based on the Amazon Software License, and 
based on the Apache Legal resolved issues 
(http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
distributed by Apache projects.

More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-09-06 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468339#comment-15468339
 ] 

Sital Kedia commented on SPARK-16922:
-

[~davies] -  Thanks for looking into this. I tested the failing job with the 
fix and it works fine now. 

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-09-06 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-17417:


 Summary: Fix # of partitions for RDD while checkpointing - 
Currently limited by 1(%05d)
 Key: SPARK-17417
 URL: https://issues.apache.org/jira/browse/SPARK-17417
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Dhruve Ashar


Spark currently assumes # of partitions to be less than 10 and uses %05d 
padding. 

If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
and fails. This is because of part-files are sorted and compared as strings. 

This leads filename order to be part-1, part-10, ... instead of 
part-1, part-10001, ..., part-10 and while reconstructing the 
checkpointed RDD the job fails. 

Possible solutions: 
- Bump the padding to allow more partitions or
- Sort the part files extracting a sub-portion as string and then verify the RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17317) Add package vignette to SparkR

2016-09-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17317:


Assignee: Apache Spark

> Add package vignette to SparkR
> --
>
> Key: SPARK-17317
> URL: https://issues.apache.org/jira/browse/SPARK-17317
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>Assignee: Apache Spark
>
> In publishing SparkR to CRAN, it would be nice to have a vignette as a user 
> guide that
> * describes the big picture
> * introduces the use of various methods
> This is important for new users because they may not even know which method 
> to look up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17317) Add package vignette to SparkR

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468250#comment-15468250
 ] 

Apache Spark commented on SPARK-17317:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14980

> Add package vignette to SparkR
> --
>
> Key: SPARK-17317
> URL: https://issues.apache.org/jira/browse/SPARK-17317
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>
> In publishing SparkR to CRAN, it would be nice to have a vignette as a user 
> guide that
> * describes the big picture
> * introduces the use of various methods
> This is important for new users because they may not even know which method 
> to look up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >