[jira] [Created] (SPARK-18869) Add lp and pp to plan nodes for getting logical plans and physical plans

2016-12-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18869:
---

 Summary: Add lp and pp to plan nodes for getting logical plans and 
physical plans
 Key: SPARK-18869
 URL: https://issues.apache.org/jira/browse/SPARK-18869
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather 
than a more specific type. It would be easier for interactive debugging to 
introduce lp that returns LogicalPlan, and pp that returns SparkPlan.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-18854:
---

Assignee: Reynold Xin

> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.3, 2.1.0
>
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but getNodeNumbered ignores 
> innerChild and as a result returns the wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18854.
-
  Resolution: Fixed
   Fix Version/s: 2.1.0
  2.0.3
Target Version/s: 2.0.3, 2.1.0  (was: 2.0.3, 2.1.1, 2.2.0)

> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.3, 2.1.0
>
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but getNodeNumbered ignores 
> innerChild and as a result returns the wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18853) Project (UnaryNode) is way too aggressive in estimating statistics

2016-12-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749405#comment-15749405
 ] 

Reynold Xin commented on SPARK-18853:
-

Let's do that separately (I thought about doing it but it might be better to be 
done together with the CBO work anyway).


> Project (UnaryNode) is way too aggressive in estimating statistics 
> ---
>
> Key: SPARK-18853
> URL: https://issues.apache.org/jira/browse/SPARK-18853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.3, 2.1.0
>
>
> We currently define statistics in UnaryNode: 
> {code}
>   override def statistics: Statistics = {
> // There should be some overhead in Row object, the size should not be 
> zero when there is
> // no columns, this help to prevent divide-by-zero error.
> val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
> val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
> // Assume there will be the same number of rows as child has.
> var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
> childRowSize
> if (sizeInBytes == 0) {
>   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
> zero
>   // (product of children).
>   sizeInBytes = 1
> }
> child.statistics.copy(sizeInBytes = sizeInBytes)
>   }
> {code}
> This has a few issues:
> 1. This can aggressively underestimate the size for Project. We assume each 
> array/map has 100 elements, which is an overestimate. If the user projects a 
> single field out of a deeply nested field, this would lead to huge 
> underestimation. A safer sane default is probably 1.
> 2. It is not a property of UnaryNode to propagate statistics this way. It 
> should be a property of Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18730.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.0

> Ask the build script to link to Jenkins test report page instead of full 
> console output page when posting to GitHub
> ---
>
> Key: SPARK-18730
> URL: https://issues.apache.org/jira/browse/SPARK-18730
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 2.1.0, 2.2.0
>
>
> Currently, the full console output page of a Spark Jenkins PR build can be as 
> large as several megabytes. It takes a relatively long time to load and may 
> even freeze the browser for quite a while.
> I'd suggest posting the test report page link to GitHub instead, which is way 
> more concise and is usually the first page I'd like to check when 
> investigating a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-18853) Project (UnaryNode) is way too aggressive in estimating statistics

2016-12-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15748964#comment-15748964
 ] 

Reynold Xin commented on SPARK-18853:
-

Can you say more? Are you talking about deeply nested arrays?


> Project (UnaryNode) is way too aggressive in estimating statistics 
> ---
>
> Key: SPARK-18853
> URL: https://issues.apache.org/jira/browse/SPARK-18853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently define statistics in UnaryNode: 
> {code}
>   override def statistics: Statistics = {
> // There should be some overhead in Row object, the size should not be 
> zero when there is
> // no columns, this help to prevent divide-by-zero error.
> val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
> val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
> // Assume there will be the same number of rows as child has.
> var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
> childRowSize
> if (sizeInBytes == 0) {
>   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
> zero
>   // (product of children).
>   sizeInBytes = 1
> }
> child.statistics.copy(sizeInBytes = sizeInBytes)
>   }
> {code}
> This has a few issues:
> 1. This can aggressively underestimate the size for Project. We assume each 
> array/map has 100 elements, which is an overestimate. If the user projects a 
> single field out of a deeply nested field, this would lead to huge 
> underestimation. A safer sane default is probably 1.
> 2. It is not a property of UnaryNode to propagate statistics this way. It 
> should be a property of Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18853) Project (UnaryNode) is way too aggressive in estimating statistics

2016-12-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18853:

Description: 
We currently define statistics in UnaryNode: 

{code}
  override def statistics: Statistics = {
// There should be some overhead in Row object, the size should not be zero 
when there is
// no columns, this help to prevent divide-by-zero error.
val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
// Assume there will be the same number of rows as child has.
var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
childRowSize
if (sizeInBytes == 0) {
  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
  // (product of children).
  sizeInBytes = 1
}

child.statistics.copy(sizeInBytes = sizeInBytes)
  }
{code}

This has a few issues:

1. This can aggressively underestimate the size for Project. We assume each 
array/map has 100 elements, which is an overestimate. If the user projects a 
single field out of a deeply nested field, this would lead to huge 
underestimation. A safer sane default is probably 1.

2. It is not a property of UnaryNode to propagate statistics this way. It 
should be a property of Project.





  was:
We currently define statistics in UnaryNode: 

{code}
  override def statistics: Statistics = {
// There should be some overhead in Row object, the size should not be zero 
when there is
// no columns, this help to prevent divide-by-zero error.
val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
// Assume there will be the same number of rows as child has.
var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
childRowSize
if (sizeInBytes == 0) {
  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
  // (product of children).
  sizeInBytes = 1
}

child.statistics.copy(sizeInBytes = sizeInBytes)
  }
{code}

This has a few issues:

1. This can aggressively underestimate the size for Project. We assume each 
array/map has 100 elements, which is an overestimate. If the user projects a 
single field out of a deeply nested field, this would lead to huge 
underestimation. A safer sane default is probably 2.

2. It is not a property of UnaryNode to propagate statistics this way. It 
should be a property of Project.






> Project (UnaryNode) is way too aggressive in estimating statistics 
> ---
>
> Key: SPARK-18853
> URL: https://issues.apache.org/jira/browse/SPARK-18853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently define statistics in UnaryNode: 
> {code}
>   override def statistics: Statistics = {
> // There should be some overhead in Row object, the size should not be 
> zero when there is
> // no columns, this help to prevent divide-by-zero error.
> val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
> val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
> // Assume there will be the same number of rows as child has.
> var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
> childRowSize
> if (sizeInBytes == 0) {
>   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
> zero
>   // (product of children).
>   sizeInBytes = 1
> }
> child.statistics.copy(sizeInBytes = sizeInBytes)
>   }
> {code}
> This has a few issues:
> 1. This can aggressively underestimate the size for Project. We assume each 
> array/map has 100 elements, which is an overestimate. If the user projects a 
> single field out of a deeply nested field, this would lead to huge 
> underestimation. A safer sane default is probably 1.
> 2. It is not a property of UnaryNode to propagate statistics this way. It 
> should be a property of Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18856) Newly created catalog table assumed to have 0 rows and 0 bytes

2016-12-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18856:
---

 Summary: Newly created catalog table assumed to have 0 rows and 0 
bytes
 Key: SPARK-18856
 URL: https://issues.apache.org/jira/browse/SPARK-18856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker


{code}
scala> spark.range(100).selectExpr("id % 10 p", 
"id").write.partitionBy("p").format("json").saveAsTable("testjson")

scala> spark.table("testjson").queryExecution.optimizedPlan.statistics
res6: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
Statistics(sizeInBytes=0, isBroadcastable=false)
{code}

It shouldn't be 0. The issue is that in DataSource.scala, we do:

{code}
val fileCatalog = if 
(sparkSession.sqlContext.conf.manageFilesourcePartitions &&
catalogTable.isDefined && 
catalogTable.get.tracksPartitionsInCatalog) {
  new CatalogFileIndex(
sparkSession,
catalogTable.get,
catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(0L))
} else {
  new InMemoryFileIndex(sparkSession, globbedPaths, options, 
Some(partitionSchema))
}
{code}

We shouldn't use 0L as the fallback.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18854:

Target Version/s: 2.0.3, 2.1.1, 2.2.0  (was: 2.1.1, 2.2.0)

> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but getNodeNumbered ignores 
> innerChild and as a result returns the wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18854:

Description: 
This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered is not always correct.

Repro:

{code}
val df = sql("select * from range(10) where id not in " +
  "(select id from range(2) union all select id from range(2))")

println("---")
println(df.queryExecution.analyzed.numberedTreeString)
println("---")

println("---")
println(df.queryExecution.analyzed(3))
println("---")
{code}

Output looks like

{noformat}
---
00 Project [id#1L]
01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
02:  +- Union
03: :- Project [id#2L]
04: :  +- Range (0, 2, step=1, splits=None)
05: +- Project [id#3L]
06:+- Range (0, 2, step=1, splits=None)
07+- Range (0, 10, step=1, splits=None)
---
---
null
---
{noformat}

Note that 3 should be the Project node, but getNodeNumbered ignores innerChild 
and as a result returns the wrong one.

  was:
This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered is not always correct.

Repro:

{code}
val df = sql("select * from range(10) where id not in " +
  "(select id from range(2) union all select id from range(2))")

println("---")
println(df.queryExecution.analyzed.numberedTreeString)
println("---")

println("---")
println(df.queryExecution.analyzed(3))
println("---")
{code}

Output looks like

{noformat}
---
00 Project [id#1L]
01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
02:  +- Union
03: :- Project [id#2L]
04: :  +- Range (0, 2, step=1, splits=None)
05: +- Project [id#3L]
06:+- Range (0, 2, step=1, splits=None)
07+- Range (0, 10, step=1, splits=None)
---
---
null
---
{noformat}

Note that 3 should be the Project node, but 


> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but getNodeNumbered ignores 
> innerChild and as a result returns the wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18854:

Description: 
This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered is not always correct.

Repro:

{code}
val df = sql("select * from range(10) where id not in " +
  "(select id from range(2) union all select id from range(2))")

println("---")
println(df.queryExecution.analyzed.numberedTreeString)
println("---")

println("---")
println(df.queryExecution.analyzed(3))
println("---")
{code}

Output looks like

{noformat}
---
00 Project [id#1L]
01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
02:  +- Union
03: :- Project [id#2L]
04: :  +- Range (0, 2, step=1, splits=None)
05: +- Project [id#3L]
06:+- Range (0, 2, step=1, splits=None)
07+- Range (0, 10, step=1, splits=None)
---
---
null
---
{noformat}

Note that 3 should be the Project node, but 

  was:
This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered is not always correct.




> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.
> Repro:
> {code}
> val df = sql("select * from range(10) where id not in " +
>   "(select id from range(2) union all select id from range(2))")
> println("---")
> println(df.queryExecution.analyzed.numberedTreeString)
> println("---")
> println("---")
> println(df.queryExecution.analyzed(3))
> println("---")
> {code}
> Output looks like
> {noformat}
> ---
> 00 Project [id#1L]
> 01 +- Filter NOT predicate-subquery#0 [(id#1L = id#2L)]
> 02:  +- Union
> 03: :- Project [id#2L]
> 04: :  +- Range (0, 2, step=1, splits=None)
> 05: +- Project [id#3L]
> 06:+- Range (0, 2, step=1, splits=None)
> 07+- Range (0, 10, step=1, splits=None)
> ---
> ---
> null
> ---
> {noformat}
> Note that 3 should be the Project node, but 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747080#comment-15747080
 ] 

Reynold Xin commented on SPARK-18854:
-

To test this, introduce a subquery and call df.numberedTreeString() and then 
df(i).


> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18854:

Description: 
This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered is not always correct.



  was:
This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered(x) is not always correct.




> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered is not always correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18854:
---

 Summary: getNodeNumbered and generateTreeString are not consistent
 Key: SPARK-18854
 URL: https://issues.apache.org/jira/browse/SPARK-18854
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin


This is a bug introduced by subquery handling. generateTreeString numbers trees 
including innerChildren (used to print subqueries), but getNodeNumbered ignores 
that. As a result, getNodeNumbered(x) is not always correct.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18854) getNodeNumbered and generateTreeString are not consistent

2016-12-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747078#comment-15747078
 ] 

Reynold Xin commented on SPARK-18854:
-

cc [~smilegator]

> getNodeNumbered and generateTreeString are not consistent
> -
>
> Key: SPARK-18854
> URL: https://issues.apache.org/jira/browse/SPARK-18854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> This is a bug introduced by subquery handling. generateTreeString numbers 
> trees including innerChildren (used to print subqueries), but getNodeNumbered 
> ignores that. As a result, getNodeNumbered(x) is not always correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18853) Project (UnaryNode) is way too aggressive in estimating statistics

2016-12-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18853:

Summary: Project (UnaryNode) is way too aggressive in estimating statistics 
  (was: Project is way too aggressive in estimating statistics )

> Project (UnaryNode) is way too aggressive in estimating statistics 
> ---
>
> Key: SPARK-18853
> URL: https://issues.apache.org/jira/browse/SPARK-18853
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently define statistics in UnaryNode: 
> {code}
>   override def statistics: Statistics = {
> // There should be some overhead in Row object, the size should not be 
> zero when there is
> // no columns, this help to prevent divide-by-zero error.
> val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
> val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
> // Assume there will be the same number of rows as child has.
> var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
> childRowSize
> if (sizeInBytes == 0) {
>   // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
> zero
>   // (product of children).
>   sizeInBytes = 1
> }
> child.statistics.copy(sizeInBytes = sizeInBytes)
>   }
> {code}
> This has a few issues:
> 1. This can aggressively underestimate the size for Project. We assume each 
> array/map has 100 elements, which is an overestimate. If the user projects a 
> single field out of a deeply nested field, this would lead to huge 
> underestimation. A safer sane default is probably 2.
> 2. It is not a property of UnaryNode to propagate statistics this way. It 
> should be a property of Project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18853) Project is way too aggressive in estimating statistics

2016-12-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18853:
---

 Summary: Project is way too aggressive in estimating statistics 
 Key: SPARK-18853
 URL: https://issues.apache.org/jira/browse/SPARK-18853
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin


We currently define statistics in UnaryNode: 

{code}
  override def statistics: Statistics = {
// There should be some overhead in Row object, the size should not be zero 
when there is
// no columns, this help to prevent divide-by-zero error.
val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
// Assume there will be the same number of rows as child has.
var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / 
childRowSize
if (sizeInBytes == 0) {
  // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be 
zero
  // (product of children).
  sizeInBytes = 1
}

child.statistics.copy(sizeInBytes = sizeInBytes)
  }
{code}

This has a few issues:

1. This can aggressively underestimate the size for Project. We assume each 
array/map has 100 elements, which is an overestimate. If the user projects a 
single field out of a deeply nested field, this would lead to huge 
underestimation. A safer sane default is probably 2.

2. It is not a property of UnaryNode to propagate statistics this way. It 
should be a property of Project.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746414#comment-15746414
 ] 

Reynold Xin commented on SPARK-18676:
-

That's the other option I was considering. It'd be good to decouple the 
mechanism for data distribution (broadcast vs shuffle) and join method (hash vs 
sort-merge). And both hash and sort-merge should work with data larger than 
memory.


> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15743362#comment-15743362
 ] 

Reynold Xin commented on SPARK-18676:
-

Can we just increase the size by 5X if it is a Parquet or ORC file?


> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18815:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-16026

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18815) NPE when collecting column stats for string/binary column having only null values

2016-12-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18815.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.2.0
   2.1.1

> NPE when collecting column stats for string/binary column having only null 
> values
> -
>
> Key: SPARK-18815
> URL: https://issues.apache.org/jira/browse/SPARK-18815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737074#comment-15737074
 ] 

Reynold Xin commented on SPARK-18814:
-

cc [~hvanhovell] and [~nsyca]

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734033#comment-15734033
 ] 

Reynold Xin commented on SPARK-18278:
-

In the past few days I've given this a lot of thought.

I'm personally very interested in this work, and would actually use it myself. 
That said, based on my experience, the real work starts after the initial thing 
works, i.e. the maintenance and enhancement work in the future will be much 
larger than the initial commit. Adding another officially supported scheduler 
definitely has some serious (and maybe disruptive) impacts to Spark. Some 
examples are ...

1. Testing becomes more complicated.
2. Related to 1, releases become more likely to be delayed. In the past many 
Spark releases were delayed due to bugs in Mesos integration or the YARN 
integration, because those are harder to be tested reliably in an automated 
fashion.
3. The release process has to change.

Given Kubernetes is still very young, and unclear how successful it will be in 
the future (I personally think it will be, but you never know), I would make 
the following, concrete recommendations on moving this forward:

1. See if we can implement this as an add-on (library) outside Spark If not 
possible, what about a fork?
2. Publish some non-official docker images so it is easy to use Spark on 
Kubernetes this way.
3. Encourage users to use it and get feedback. Have the contributors that are 
really interested in this work maintain it for couple Spark releases (this 
includes testing the implementation, publishing new docker images, writing 
documentations).
4. Evaluate later (say 2 releases) how well this has been received on whether 
we take a coordinated effort to merge this into Spark, since it might become 
the most popular cluster manager.



> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18774) Ignore non-existing files when ignoreCorruptFiles is enabled

2016-12-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18774:

Fix Version/s: 2.1.1

> Ignore non-existing files when ignoreCorruptFiles is enabled
> 
>
> Key: SPARK-18774
> URL: https://issues.apache.org/jira/browse/SPARK-18774
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18760) Provide consistent format output for all file formats

2016-12-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18760.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Provide consistent format output for all file formats
> -
>
> Key: SPARK-18760
> URL: https://issues.apache.org/jira/browse/SPARK-18760
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.1, 2.2.0
>
>
> We currently rely on FileFormat implementations to override toString in order 
> to get a proper explain output. It'd be better to just depend on shortName 
> for those.
> Before:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: 
> org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> After:
> {noformat}
> scala> spark.read.text("test.text").explain()
> == Physical Plan ==
> *FileScan text [value#15] Batched: false, Format: text, Location: 
> InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3359:
---
Fix Version/s: (was: 2.1.1)
   2.1.0

> `sbt/sbt unidoc` doesn't work with Java 8
> -
>
> Key: SPARK-3359
> URL: https://issues.apache.org/jira/browse/SPARK-3359
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: errors.txt
>
>
> It seems that Java 8 is stricter on JavaDoc. I got many error messages like
> {code}
> [error] 
> /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
>  error: modifier private not allowed here
> [error] private abstract interface SparkHadoopMapRedUtil {
> [error]  ^
> {code}
> This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18615) Switch to multi-line doc to avoid a genjavadoc bug for backticks

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18615:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Switch to multi-line doc to avoid a genjavadoc bug for backticks
> 
>
> Key: SPARK-18615
> URL: https://issues.apache.org/jira/browse/SPARK-18615
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> I suspect this is related with SPARK-16153 and genjavadoc issue in 
> https://github.com/typesafehub/genjavadoc/issues/85 but I am not too sure.
> Currently, single line comment does not mark down backticks to 
> {{..}} but prints as they are. For example, the line below:
> {code}
> /** Return an RDD with the pairs from `this` whose keys are not in `other`. */
> {code}
> So, we could work around this as below:
> {code}
> /**
>  * Return an RDD with the pairs from `this` whose keys are not in `other`.
>  */
> {code}
> Please refer the image in the pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18685) Fix all tests in ExecutorClassLoaderSuite to pass on Windows

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18685:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Fix all tests in ExecutorClassLoaderSuite to pass on Windows
> 
>
> Key: SPARK-18685
> URL: https://issues.apache.org/jira/browse/SPARK-18685
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell, Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> There are two problems as below:
> We should make the URI correct and {{BufferedSource}} from 
> {{Source.fromInputStream}} closed after opening them in the tests in 
> {{ExecutorClassLoaderSuite}}. Currently, these are leading to test failures 
> on Windows.
> {code}
> ExecutorClassLoaderSuite:
> [info] - child first *** FAILED *** (78 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - parent first *** FAILED *** (15 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fall back *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - child first can fail *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resource from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> ...
> [info] - resources from parent *** FAILED *** (0 milliseconds)
> [info]   java.net.URISyntaxException: Illegal character in authority at index 
> 7: 
> file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
> [info]   at java.net.URI$Parser.fail(URI.java:2848)
> [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
> {code}
> {code}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 
> 333 milliseconds)
> [info]   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
> [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
> [info]   at 
> org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18645) spark-daemon.sh arguments error lead to throws Unrecognized option

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18645:

Fix Version/s: (was: 2.1.1)
   2.1.0

> spark-daemon.sh arguments error lead to throws Unrecognized option
> --
>
> Key: SPARK-18645
> URL: https://issues.apache.org/jira/browse/SPARK-18645
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.1.0
>
>
> {{start-thriftserver.sh}} can reproduce this:
> {noformat}
> [root@dev spark]# ./sbin/start-thriftserver.sh --conf 
> 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp' 
> starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
> /tmp/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-dev.out
> failed to launch nice -n 0 bash 
> /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit
>  --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name 
> Thrift JDBC/ODBC Server --conf spark.driver.extraJavaOptions=-XX:+UseG1GC 
> -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp:
>   Error starting HiveServer2 with given arguments: 
>   Unrecognized option: -XX:-HeapDumpOnOutOfMemoryError
> full log in 
> /tmp/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-dev.out
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18762) Web UI should be http:4040 instead of https:4040

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18762:

Fix Version/s: (was: 2.1.1)
   2.1.0

> Web UI should be http:4040 instead of https:4040
> 
>
> Key: SPARK-18762
> URL: https://issues.apache.org/jira/browse/SPARK-18762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Web UI
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 2.0.3, 2.1.0
>
>
> When SSL is enabled, the Spark shell shows:
> {code}
> Spark context Web UI available at https://192.168.99.1:4040
> {code}
> This is wrong because 4040 is http, not https. It redirects to the https port.
> More importantly, this introduces several broken links in the UI. For 
> example, in the master UI, the worker link is https:8081 instead of http:8081 
> or https:8481.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18546) UnsafeShuffleWriter corrupts encrypted shuffle files when merging

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18546:

Fix Version/s: (was: 2.1.1)
   2.1.0

> UnsafeShuffleWriter corrupts encrypted shuffle files when merging
> -
>
> Key: SPARK-18546
> URL: https://issues.apache.org/jira/browse/SPARK-18546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 2.1.0
>
>
> The merging algorithm in {{UnsafeShuffleWriter}} does not consider 
> encryption, and when it tries to merge encrypted files the result data cannot 
> be read, since data encrypted with different initial vectors is interleaved 
> in the same partition data. This leads to exceptions when trying to read the 
> files during shuffle:
> {noformat}
> com.esotericsoftware.kryo.KryoException: com.ning.compress.lzf.LZFException: 
> Corrupt input data, block did not start with 2 byte signature ('ZV') followed 
> by type byte, 2-byte length)
>   at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
>   at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
>   at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
>   at 
> org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:512)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:533)
> ...
> {noformat}
> (This is our internal branch so don't worry if lines don't necessarily match.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18774) Ignore non-existing files when ignoreCorruptFiles is enabled

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18774:

Fix Version/s: (was: 2.1.1)
   2.2.0

> Ignore non-existing files when ignoreCorruptFiles is enabled
> 
>
> Key: SPARK-18774
> URL: https://issues.apache.org/jira/browse/SPARK-18774
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18774) Ignore non-existing files when ignoreCorruptFiles is enabled

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18774.
-
   Resolution: Fixed
Fix Version/s: 2.1.1

> Ignore non-existing files when ignoreCorruptFiles is enabled
> 
>
> Key: SPARK-18774
> URL: https://issues.apache.org/jira/browse/SPARK-18774
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18745) java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18745:

Target Version/s:   (was: 2.1.0)

> java.lang.IndexOutOfBoundsException running query 68 Spark SQL on (100TB)
> -
>
> Key: SPARK-18745
> URL: https://issues.apache.org/jira/browse/SPARK-18745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: JESSE CHEN
>Assignee: Kazuaki Ishizaki
>Priority: Critical
> Fix For: 2.1.0
>
>
> Running query 68 with decreased executor memory (using 12GB executors instead 
> of 24GB) on 100TB parquet database using the Spark master dated 11/04 gave 
> IndexOutOfBoundsException.
> The query is as follows:
> {noformat}
> [select  c_last_name
>,c_first_name
>,ca_city
>,bought_city
>,ss_ticket_number
>,extended_price
>,extended_tax
>,list_price
>  from (select ss_ticket_number
>  ,ss_customer_sk
>  ,ca_city bought_city
>  ,sum(ss_ext_sales_price) extended_price 
>  ,sum(ss_ext_list_price) list_price
>  ,sum(ss_ext_tax) extended_tax 
>from store_sales
>,date_dim
>,store
>,household_demographics
>,customer_address 
>where store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  and store_sales.ss_store_sk = store.s_store_sk  
> and store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
> and store_sales.ss_addr_sk = customer_address.ca_address_sk
> and date_dim.d_dom between 1 and 2 
> and (household_demographics.hd_dep_count = 8 or
>  household_demographics.hd_vehicle_count= -1)
> and date_dim.d_year in (2000,2000+1,2000+2)
> and store.s_city in ('Plainview','Rogers')
>group by ss_ticket_number
>,ss_customer_sk
>,ss_addr_sk,ca_city) dn
>   ,customer
>   ,customer_address current_addr
>  where ss_customer_sk = c_customer_sk
>and customer.c_current_addr_sk = current_addr.ca_address_sk
>and current_addr.ca_city <> bought_city
>  order by c_last_name
>  ,ss_ticket_number
>   limit 100]
> {noformat}
> Spark output that showed the exception:
> {noformat}
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at 
> org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecuteBroadcast(Exchange.scala:61)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
>   at 
> 

[jira] [Resolved] (SPARK-18654) JacksonParser.makeRootConverter has effectively unreachable code

2016-12-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18654.
-
   Resolution: Fixed
 Assignee: Nathan Howell
Fix Version/s: 2.2.0

> JacksonParser.makeRootConverter has effectively unreachable code
> 
>
> Key: SPARK-18654
> URL: https://issues.apache.org/jira/browse/SPARK-18654
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{JacksonParser.makeRootConverter}} currently takes a {{DataType}} but is 
> only called with a {{StructType}}. Revising the method to only accept a 
> {{StructType}} allows us to remove some pattern matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18775) Limit the max number of records written per file

2016-12-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18775:
---

 Summary: Limit the max number of records written per file
 Key: SPARK-18775
 URL: https://issues.apache.org/jira/browse/SPARK-18775
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Currently, Spark writes a single file out per task, sometimes leading to very 
large files. It would be great to have an option to limit the max number of 
records written per file in a task, to avoid humongous files.

This was initially suggested by [~simeons].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18760) Provide consistent format output for all file formats

2016-12-06 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18760:
---

 Summary: Provide consistent format output for all file formats
 Key: SPARK-18760
 URL: https://issues.apache.org/jira/browse/SPARK-18760
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently rely on FileFormat implementations to override toString in order 
to get a proper explain output. It'd be better to just depend on shortName for 
those.

Before:
{noformat}
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: 
org.apache.spark.sql.execution.datasources.text.TextFileFormat@xyz, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
{noformat}

After:
{noformat}
scala> spark.read.text("test.text").explain()
== Physical Plan ==
*FileScan text [value#15] Batched: false, Format: text, Location: 
InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11482) Maven repo in IsolatedClientLoader should be configurable.

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-11482.
---
Resolution: Later

> Maven repo in IsolatedClientLoader should be configurable. 
> ---
>
> Key: SPARK-11482
> URL: https://issues.apache.org/jira/browse/SPARK-11482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
>Reporter: Doug Balog
>Priority: Minor
>
> The maven repo used to fetch the hive jars and dependencies is hard coded.
> A user should be able to override it via configuration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7263.
--
Resolution: Later

> Add new shuffle manager which stores shuffle blocks in Parquet
> --
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8398) Consistently expose Hadoop Configuration/JobConf parameters for Hadoop input/output formats

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-8398.
--
Resolution: Later

> Consistently expose Hadoop Configuration/JobConf parameters for Hadoop 
> input/output formats
> ---
>
> Key: SPARK-8398
> URL: https://issues.apache.org/jira/browse/SPARK-8398
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: koert kuipers
>Priority: Trivial
>
> Currently a custom Hadoop Configuration or JobConf can be passed into quite a 
> few functions that use Hadoop input formats to read or Hadoop output formats 
> to write data. The goal of this JIRA is to make this consistent and expose 
> Configuration/JobConf for all these methods, which facilitates re-use and 
> discourages many additional parameters (that end up changing the 
> Configuration/JobConf internally). 
> See also:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-hadoop-input-output-format-advanced-control-td11168.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16948) Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16948.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Use metastore schema instead of inferring schema for ORC in 
> HiveMetastoreCatalog
> 
>
> Key: SPARK-16948
> URL: https://issues.apache.org/jira/browse/SPARK-16948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Querying empty partitioned ORC tables from spark-sql throws exception with 
> "spark.sql.hive.convertMetastoreOrc=true".
> {noformat}
> java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:347)
> at scala.None$.get(Option.scala:345)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423)
> at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18681:

Target Version/s: 2.1.0

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18681
> URL: https://issues.apache.org/jira/browse/SPARK-18681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>
> Cloudera put 
> {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}}
>  as the configuration file for the Hive Metastore Server, where 
> {{hive.metastore.try.direct.sql=false}}. But Spark isn't reading this 
> configuration file and get default value 
> {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or 
> {{getMSC.getConfigValue}} method to obtain the original configuration from 
> Hive Metastore Server.
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> 

[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-12-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726694#comment-15726694
 ] 

Reynold Xin commented on SPARK-18209:
-

I took a look at the change quickly and here are my high level thoughts: we 
should break down the change into multiple, smaller pull requests. Here's one 
way to break it down:

PR1: Introduce the concept of an AnalysisContext to the analyzer. The 
AnalysisContext should contain information needed for analysis. For now, it 
will have only one piece of information: the current database for a subtree. In 
the beginning of analysis, the current database is set to the session local 
current database. When the analyzer descend down the tree and finds a view 
node, the current database can be set to the view's database.

This way, we decouple the concerns of analysis database environment from the 
catalog.

PR2: Implement the read side of the new view resolution

PR3: Implement the write side of the new view resolution.



> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-12-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726694#comment-15726694
 ] 

Reynold Xin edited comment on SPARK-18209 at 12/6/16 9:01 PM:
--

I took a look at the change quickly and here are my high level thoughts: we 
should break down the change into multiple, smaller pull requests. Here's one 
way to break it down:

PR1: Introduce the concept of an AnalysisContext to the analyzer. The 
AnalysisContext should contain information needed for analysis. For now, it 
will have only one piece of information: the current database for a subtree. In 
the beginning of analysis, the current database is set to the session local 
current database. When the analyzer descend down the tree and finds a view 
node, the current database can be set to the view's database.

This way, we decouple the concerns of analysis database environment from the 
catalog.

PR2: Implement the read side of the new view resolution

PR3: Implement the write side of the new view resolution.

PR4 - PRn: Some small incremental improvements, e.g. limit view reference depth 
to 32.




was (Author: rxin):
I took a look at the change quickly and here are my high level thoughts: we 
should break down the change into multiple, smaller pull requests. Here's one 
way to break it down:

PR1: Introduce the concept of an AnalysisContext to the analyzer. The 
AnalysisContext should contain information needed for analysis. For now, it 
will have only one piece of information: the current database for a subtree. In 
the beginning of analysis, the current database is set to the session local 
current database. When the analyzer descend down the tree and finds a view 
node, the current database can be set to the view's database.

This way, we decouple the concerns of analysis database environment from the 
catalog.

PR2: Implement the read side of the new view resolution

PR3: Implement the write side of the new view resolution.



> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18555) na.fill miss up original values in long integers

2016-12-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18555.
-
   Resolution: Fixed
 Assignee: Song Jun
Fix Version/s: 2.2.0

> na.fill miss up original values in long integers
> 
>
> Key: SPARK-18555
> URL: https://issues.apache.org/jira/browse/SPARK-18555
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Mahmoud Rawas
>Assignee: Song Jun
>Priority: Critical
> Fix For: 2.2.0
>
>
> Manly the issue is clarified in the following example:
> Given a Dataset: 
> scala> data.show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426677101|9123146560113991650|
> theoretically when we call na.fill(0) nothing should change, while the 
> current result is:
> scala> data.na.fill(0).show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426676736|9123146560113991680|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18284) Scheme of DataFrame generated from RDD is diffrent between master and 2.0

2016-12-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18284:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Scheme of DataFrame generated from RDD is diffrent between master and 2.0
> -
>
> Key: SPARK-18284
> URL: https://issues.apache.org/jira/browse/SPARK-18284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> When the following program is executed, a schema of dataframe is different 
> among master, branch 2.0, and branch 2.1. The result should be false.
> {code:java}
> val df = sparkContext.parallelize(1 to 8, 1).toDF()
> df.printSchema
> df.filter("value > 4").count
> === master ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.1 ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.0 ===
> root
>  |-- value: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18284) Scheme of DataFrame generated from RDD is different between master and 2.0

2016-12-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18284:

Summary: Scheme of DataFrame generated from RDD is different between master 
and 2.0  (was: Scheme of DataFrame generated from RDD is diffrent between 
master and 2.0)

> Scheme of DataFrame generated from RDD is different between master and 2.0
> --
>
> Key: SPARK-18284
> URL: https://issues.apache.org/jira/browse/SPARK-18284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> When the following program is executed, a schema of dataframe is different 
> among master, branch 2.0, and branch 2.1. The result should be false.
> {code:java}
> val df = sparkContext.parallelize(1 to 8, 1).toDF()
> df.printSchema
> df.filter("value > 4").count
> === master ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.1 ===
> root
>  |-- value: integer (nullable = true)
> === branch 2.0 ===
> root
>  |-- value: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15721488#comment-15721488
 ] 

Reynold Xin commented on SPARK-18539:
-

Why don't we fix the parquet reader so it can tolerate non-existent columns?


> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   

[jira] [Created] (SPARK-18714) Add a simple time function to SparkSession

2016-12-04 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18714:
---

 Summary: Add a simple time function to SparkSession
 Key: SPARK-18714
 URL: https://issues.apache.org/jira/browse/SPARK-18714
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Many Spark developers often want to test the runtime of some function in 
interactive debugging and testing. It'd be really useful to have a simple 
spark.time method that can test the runtime.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18714) SparkSession.time - a simple timer function

2016-12-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18714:

Summary: SparkSession.time - a simple timer function  (was: Add a simple 
time function to SparkSession)

> SparkSession.time - a simple timer function
> ---
>
> Key: SPARK-18714
> URL: https://issues.apache.org/jira/browse/SPARK-18714
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Many Spark developers often want to test the runtime of some function in 
> interactive debugging and testing. It'd be really useful to have a simple 
> spark.time method that can test the runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18702) input_file_block_start and input_file_block_length function

2016-12-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18702.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> input_file_block_start and input_file_block_length function
> ---
>
> Key: SPARK-18702
> URL: https://issues.apache.org/jira/browse/SPARK-18702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> We currently have function input_file_name to get the path of the input file, 
> but don't have functions to get the block start offset and length. This patch 
> introduces two functions:
> 1. input_file_block_start: returns the file block start offset, or -1 if not 
> available.
> 2. input_file_block_length: returns the file block length, or -1 if not 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18702) input_file_block_start and input_file_block_length function

2016-12-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18702:
---

 Summary: input_file_block_start and input_file_block_length 
function
 Key: SPARK-18702
 URL: https://issues.apache.org/jira/browse/SPARK-18702
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently have function input_file_name to get the path of the input file, 
but don't have functions to get the block start offset and length. This patch 
introduces two functions:

1. input_file_block_start: returns the file block start offset, or -1 if not 
available.

2. input_file_block_length: returns the file block length, or -1 if not 
available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2016-12-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15718790#comment-15718790
 ] 

Reynold Xin commented on SPARK-8007:


spark_partition_id() is available in PySpark starting 1.6. It's in 
pyspark.functions.spark_partition_id.


> Support resolving virtual columns in DataFrames
> ---
>
> Key: SPARK-8007
> URL: https://issues.apache.org/jira/browse/SPARK-8007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Joseph Batchik
>
> Create the infrastructure so we can resolve df("SPARK__PARTITION__ID") to 
> SparkPartitionID expression.
> A cool use case is to understand physical data skew:
> {code}
> df.groupBy("SPARK__PARTITION__ID").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18362) Use TextFileFormat in implementation of CSVFileFormat

2016-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18362.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Use TextFileFormat in implementation of CSVFileFormat
> -
>
> Key: SPARK-18362
> URL: https://issues.apache.org/jira/browse/SPARK-18362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> Spark's CSVFileFormat data source uses inefficient methods for reading files 
> during schema inference and does not benefit from file listing / IO 
> performance improvements made in Spark 2.0. In order to fix this performance 
> problem, we should re-implement those read paths in terms of TextFileFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18695) Bump master branch version to 2.2.0-SNAPSHOT

2016-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18695.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Bump master branch version to 2.2.0-SNAPSHOT
> 
>
> Key: SPARK-18695
> URL: https://issues.apache.org/jira/browse/SPARK-18695
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15717149#comment-15717149
 ] 

Reynold Xin commented on SPARK-18278:
-

Is there a way to get this working without the project having to publish docker 
images?

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18695) Bump master branch version to 2.2.0-SNAPSHOT

2016-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18695:

Summary: Bump master branch version to 2.2.0-SNAPSHOT  (was: Bump master 
branch version to 2.2.0)

> Bump master branch version to 2.2.0-SNAPSHOT
> 
>
> Key: SPARK-18695
> URL: https://issues.apache.org/jira/browse/SPARK-18695
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18695) Bump master branch version to 2.2.0

2016-12-02 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18695:
---

 Summary: Bump master branch version to 2.2.0
 Key: SPARK-18695
 URL: https://issues.apache.org/jira/browse/SPARK-18695
 Project: Spark
  Issue Type: Task
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18690) Backward compatibility of unbounded frames

2016-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18690.
-
   Resolution: Fixed
 Assignee: Maciej Szymkiewicz
Fix Version/s: 2.1.0

> Backward compatibility of unbounded frames
> --
>
> Key: SPARK-18690
> URL: https://issues.apache.org/jira/browse/SPARK-18690
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.1.0
>
>
> SPARK-17845 introduced constant values to mark unbounded frame. This can 
> break backward compatibility on some systems:
> In Spark <= 2.0:
> -  {{UNBOUNDED PRECEDING}} is {{-sys.maxisze}}
> -  {{UNBOUNDED FOLLOWING}} is {{sys.maxisze}}
> On 64 bit systems {{-sys.maxisze}} is typically equal to  (1 << 63) - 1, on 
> 32 bit systems (1 << 31) - 1 
> (https://docs.python.org/3/library/sys.html#sys.maxsize).
> After SPARK-17845 this values are
> -  {{UNBOUNDED PRECEDING}} is -(1 << 63)
> -  {{UNBOUNDED FOLLOWING}} is (1 << 63) - 1
> As a result on many systems current code won't no longer use UNBOUNDED 
> PRECEDING frame.
> We can use following values to ensure backward compatibility:
> - {{UNBOUNDED PRECEDING}} =  {{max(-sys.maxsize, _JAVA_MIN_LONG)}}
> - {{UNBOUNDED FOLLOWING}} =  {{min(sys.maxsize, _JAVA_MAX_LONG)}}
> Pros:
> - Prevents hard to spot errors in the user code.
> Cons:
> - Unnecessary complicated rules in the Spark code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11705) Eliminate unnecessary Cartesian Join

2016-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-11705.
---
Resolution: Cannot Reproduce

> Eliminate unnecessary Cartesian Join
> 
>
> Key: SPARK-11705
> URL: https://issues.apache.org/jira/browse/SPARK-11705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> When we have some queries similar to following (don’t remember the exact 
> form):
> select * from a, b, c, d where a.key1 = c.key1 and b.key2 = c.key2 and c.key3 
> = d.key3
> There will be a cartesian join between a and b. But if we just simply change 
> the table order, for example from a, c, b, d, such cartesian join are 
> eliminated.
> Without such manual tuning, the query will never finish if a, b are big. But 
> we should not relies on such manual optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16845:

Component/s: (was: Java API)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18661:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18679:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18659:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The first three test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18640.
-
  Resolution: Fixed
   Fix Version/s: 2.1.0
  2.0.3
Target Version/s:   (was: 1.6.4, 2.0.3, 2.1.0, 2.2.0)

> Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
> 
>
> Key: SPARK-18640
> URL: https://issues.apache.org/jira/browse/SPARK-18640
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
> executorIdToRunningTaskIds map without proper synchronization. We should fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-12-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714189#comment-15714189
 ] 

Reynold Xin commented on SPARK-18640:
-

[~andrewor14] how come you didn't close the ticket?


> Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
> 
>
> Key: SPARK-18640
> URL: https://issues.apache.org/jira/browse/SPARK-18640
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 2.0.3, 2.1.0
>
>
> The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
> executorIdToRunningTaskIds map without proper synchronization. We should fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17213.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18658.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
> Fix For: 2.2.0
>
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18663.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Simplify CountMinSketch aggregate implementation
> 
>
> Key: SPARK-18663
> URL: https://issues.apache.org/jira/browse/SPARK-18663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
> implementation and testing is more complicated than needed. This simplifies 
> the test cases and removes support for data types that don't have clear 
> equality semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18639) Build only a single pip package

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18639.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Build only a single pip package
> ---
>
> Key: SPARK-18639
> URL: https://issues.apache.org/jira/browse/SPARK-18639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> We current build 5 separate pip binary tar balls, doubling the release script 
> runtime. It'd be better to build one, especially for use cases that are just 
> using Spark locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18617:

Fix Version/s: 2.0.3

> Close "kryo auto pick" feature for Spark Streaming
> --
>
> Key: SPARK-18617
> URL: https://issues.apache.org/jira/browse/SPARK-18617
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Genmao Yu
> Fix For: 2.0.3, 2.1.0
>
>
> [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to 
> fix the bug, i.e. {{receiver data can not be deserialized properly}}. As 
> [~zsxwing] said, it is a critical bug, but we should not break APIs between 
> maintenance releases. It may be a rational choice to close {{auto pick kryo 
> serializer}} for Spark Streaming in the first step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18666) Remove the codes checking deprecated config spark.sql.unsafe.enabled

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18666.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0

> Remove the codes checking deprecated config spark.sql.unsafe.enabled
> 
>
> Key: SPARK-18666
> URL: https://issues.apache.org/jira/browse/SPARK-18666
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Trivial
> Fix For: 2.1.0
>
>
> spark.sql.unsafe.enabled is deprecated since 1.6. There still are codes in 
> Web UI to check it. We should remove it and clean the codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18658:

Affects Version/s: (was: 2.0.2)

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18658:

Target Version/s: 2.2.0

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18658:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-18352

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18658) Writing to a text DataSource buffers one or more lines in memory

2016-12-01 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18658:

Assignee: Nathan Howell

> Writing to a text DataSource buffers one or more lines in memory
> 
>
> Key: SPARK-18658
> URL: https://issues.apache.org/jira/browse/SPARK-18658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
>
> The JSON and CSV writing paths buffer entire lines (or multiple lines) in 
> memory prior to writing to disk. For large rows this is inefficient. It may 
> make sense to skip the {{TextOutputFormat}} record writer and go directly to 
> the underlying {{FSDataOutputStream}}, allowing the writers to append 
> arbitrary byte arrays (fractions of a row) instead of a full row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16026) Cost-based Optimizer framework

2016-11-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710618#comment-15710618
 ] 

Reynold Xin commented on SPARK-16026:
-

[~ZenWzh] can we start working on operator cardinality estimation propagation 
based on what's in the catalog right now?


> Cost-based Optimizer framework
> --
>
> Key: SPARK-16026
> URL: https://issues.apache.org/jira/browse/SPARK-16026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: Spark_CBO_Design_Spec.pdf
>
>
> This is an umbrella ticket to implement a cost-based optimizer framework 
> beyond broadcast join selection. This framework can be used to implement some 
> useful optimizations such as join reordering.
> The design should discuss how to break the work down into multiple, smaller 
> logical units. For example, changes to statistics class, system catalog, cost 
> estimation/propagation in expressions, cost estimation/propagation in 
> operators can be done in decoupled pull requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18663) Simplify CountMinSketch aggregate implementation

2016-11-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18663:
---

 Summary: Simplify CountMinSketch aggregate implementation
 Key: SPARK-18663
 URL: https://issues.apache.org/jira/browse/SPARK-18663
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


SPARK-18429 introduced count-min sketch aggregate function for SQL, but the 
implementation and testing is more complicated than needed. This simplifies the 
test cases and removes support for data types that don't have clear equality 
semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18536) Failed to save to hive table when case class with empty field

2016-11-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709551#comment-15709551
 ] 

Reynold Xin commented on SPARK-18536:
-

We need to add a PreWriteCheck for Parquet.


> Failed to save to hive table when case class with empty field
> -
>
> Key: SPARK-18536
> URL: https://issues.apache.org/jira/browse/SPARK-18536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: pin_zhang
>
> {code}import scala.collection.mutable.Queue
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.streaming.Seconds
> import org.apache.spark.streaming.StreamingContext
> {code}
> 1. Test code
> {code}
> case class EmptyC()
> case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long)
> object EmptyTest {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
> val ctx = new SparkContext(conf)
> val spark = 
> SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
> val seq = Seq(EmptyCTable(EmptyC(), 100L))
> val rdd = ctx.makeRDD[EmptyCTable](seq)
> val ssc = new StreamingContext(ctx, Seconds(1))
> val queue = Queue(rdd)
> val s = ssc.queueStream(queue, false);
> s.foreachRDD((rdd, time) => {
>   if (!rdd.isEmpty) {
> import spark.sqlContext.implicits._
> rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table")
>   }
> })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}
> 2. Exception
> {noformat}
> Caused by: java.lang.IllegalStateException: Cannot build an empty group
>   at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
>   at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554)
>   at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426)
>   at org.apache.parquet.schema.Types$Builder.named(Types.java:228)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at 

[jira] [Updated] (SPARK-18536) Failed to save to hive table when case class with empty field

2016-11-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18536:

Description: 
{code}import scala.collection.mutable.Queue

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
{code}

1. Test code

{code}
case class EmptyC()
case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long)

object EmptyTest {

  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("scala").setMaster("local[2]")
val ctx = new SparkContext(conf)
val spark = 
SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
val seq = Seq(EmptyCTable(EmptyC(), 100L))
val rdd = ctx.makeRDD[EmptyCTable](seq)
val ssc = new StreamingContext(ctx, Seconds(1))

val queue = Queue(rdd)
val s = ssc.queueStream(queue, false);
s.foreachRDD((rdd, time) => {
  if (!rdd.isEmpty) {
import spark.sqlContext.implicits._
rdd.toDF.write.mode(SaveMode.Overwrite).saveAsTable("empty_table")
  }
})

ssc.start()
ssc.awaitTermination()

  }

}
{code}

2. Exception
{noformat}
Caused by: java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:554)
at org.apache.parquet.schema.Types$GroupBuilder.build(Types.java:426)
at org.apache.parquet.schema.Types$Builder.named(Types.java:228)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:527)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convert$1.apply(ParquetSchemaConverter.scala:313)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convert(ParquetSchemaConverter.scala:313)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:85)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
... 3 more
 {noformat}

  was:

import scala.collection.mutable.Queue

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
1. Test code
case class EmptyC()
case class EmptyCTable(dimensions: EmptyC, timebin: java.lang.Long)

object EmptyTest {

  def main(args: Array[String]): Unit = {
val conf = new 

[jira] [Resolved] (SPARK-18220) ClassCastException occurs when using select query on ORC file

2016-11-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18220.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> ClassCastException occurs when using select query on ORC file
> -
>
> Key: SPARK-18220
> URL: https://issues.apache.org/jira/browse/SPARK-18220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jerryjung
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: orcfile, sql
> Fix For: 2.1.0
>
>
> Error message is below.
> {noformat}
> ==
> 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from 
> hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
> 1220 bytes result sent to driver
> 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
> 42) in 116 ms on localhost (executor driver) (19/20)
> 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 
> 35)
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ORC dump info.
> ==
> File Version: 0.12 with HIVE_8732
> 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from 
> hdfs://XXX/part-0 with {include: null, offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on 
> read. Using file schema.
> Rows: 7
> Compression: ZLIB
> Compression size: 262144
> Type: 
> struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18617.
-
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.1.0

> Close "kryo auto pick" feature for Spark Streaming
> --
>
> Key: SPARK-18617
> URL: https://issues.apache.org/jira/browse/SPARK-18617
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Genmao Yu
> Fix For: 2.1.0
>
>
> [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to 
> fix the bug, i.e. {{receiver data can not be deserialized properly}}. As 
> [~zsxwing] said, it is a critical bug, but we should not break APIs between 
> maintenance releases. It may be a rational choice to close {{auto pick kryo 
> serializer}} for Spark Streaming in the first step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18145.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17861) Store data source partitions in metastore and push partition pruning into metastore

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17861.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Store data source partitions in metastore and push partition pruning into 
> metastore
> ---
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Eric Liang
>Priority: Critical
> Fix For: 2.1.0
>
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore. This change should be feature flagged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18632.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> AggregateFunction should not ImplicitCastInputTypes
> ---
>
> Key: SPARK-18632
> URL: https://issues.apache.org/jira/browse/SPARK-18632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which 
> enables implicit input type casting). This can lead to unexpected results, 
> and should only be enabled when it is suitable for the function at hand. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18639) Build only a single pip package

2016-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18639:
---

 Summary: Build only a single pip package
 Key: SPARK-18639
 URL: https://issues.apache.org/jira/browse/SPARK-18639
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin


We current build 5 separate pip binary tar balls, doubling the release script 
runtime. It'd be better to build one, especially for use cases that are just 
using Spark locally.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18429) SQL aggregate function for CountMinSketch

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18429:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-16026

> SQL aggregate function for CountMinSketch
> -
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18429) SQL aggregate function for CountMinSketch

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18429.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.2.0

> SQL aggregate function for CountMinSketch
> -
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18429) SQL aggregate function for CountMinSketch

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18429:

Summary: SQL aggregate function for CountMinSketch  (was: implement a new 
Aggregate for CountMinSketch)

> SQL aggregate function for CountMinSketch
> -
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18632:

Target Version/s: 2.2.0

> AggregateFunction should not ImplicitCastInputTypes
> ---
>
> Key: SPARK-18632
> URL: https://issues.apache.org/jira/browse/SPARK-18632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which 
> enables implicit input type casting). This can lead to unexpected results, 
> and should only be enabled when it is suitable for the function at hand. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-11-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706069#comment-15706069
 ] 

Reynold Xin commented on SPARK-17204:
-

local-cluster is different from the local mode. It is a local "cluster" with 
multiple processes.

Try 

{noformat}
> MASTER=local-cluster[2,1,1024] bin/spark-shell
{noformat}


> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> 

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-11-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706030#comment-15706030
 ] 

Reynold Xin commented on SPARK-17204:
-

Can you try repro this using the local-cluster mode?

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> 

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15705944#comment-15705944
 ] 

Reynold Xin commented on SPARK-18352:
-

I've asked [~joshrosen] to do that only for the text format, and not json.


> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15689:

Labels: releasenotes  (was: )

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18350) Support session local timezone

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18350:

Labels: releasenotes  (was: )

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18352:

Labels: releasenotes  (was: )

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16475) Broadcast Hint for SQL Queries

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16475:

Labels: releasenotes  (was: )

> Broadcast Hint for SQL Queries
> --
>
> Key: SPARK-16475
> URL: https://issues.apache.org/jira/browse/SPARK-16475
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: BroadcastHintinSparkSQL.pdf
>
>
> Broadcast hint is a way for users to manually annotate a query and suggest to 
> the query optimizer the join method. It is very useful when the query 
> optimizer cannot make optimal decision with respect to join methods due to 
> conservativeness or the lack of proper statistics.
> The DataFrame API has broadcast hint since Spark 1.5. However, we do not have 
> an equivalent functionality in SQL queries. We propose adding Hive-style 
> broadcast hint to Spark SQL.
> For more information, please see the attached document. One note about the 
> doc: in addition to supporting "MAPJOIN", we should also support 
> "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be 
> accepted:
> {code}
> SELECT /*+ MAPJOIN(b) */ ...
> SELECT /*+ BROADCASTJOIN(b) */ ...
> SELECT /*+ BROADCAST(b) */ ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18590:

Issue Type: New Feature  (was: Bug)

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18590:

Priority: Major  (was: Blocker)

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18590:

Target Version/s:   (was: 2.1.0)

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Blocker
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >