date:20160621

[jira] [Commented] (SPARK-16095) Yarn cluster mode should return consistent result for command line and SparkLauncher

2016-06-21 Thread Peng Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343760#comment-15343760
 ] 

Peng Zhang commented on SPARK-16095:


I want to make a unit test to explain this issue, but found YarnClusterSuite 
doesn't run into block * if (conf.get("spark.master") == "yarn-cluster") *.
And I filed SPARK-16125 to fix it first.


> Yarn cluster mode should return consistent result for command line and 
> SparkLauncher
> 
>
> Key: SPARK-16095
> URL: https://issues.apache.org/jira/browse/SPARK-16095
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Peng Zhang
>
> For an application with YarnApplicationState.FINISHED and 
> FinalApplicationStatus.FAILED, invoking spark-submit from command line will 
> got Exception, submit with SparkLauncher will got state with FINISHED which 
> means app succeeded.
> Also because the above fact, in test YarnClusterSuite, assert with false 
> condition will not fail the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16125) YarnClusterSuite test cluster mode incorrectly

2016-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343747#comment-15343747
 ] 

Sean Owen commented on SPARK-16125:
---

Looks like we might have related issues in UtilsSuite, context.py and 
SparkILoop; can you adjust those too?

> YarnClusterSuite test cluster mode incorrectly
> --
>
> Key: SPARK-16125
> URL: https://issues.apache.org/jira/browse/SPARK-16125
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Peng Zhang
>Priority: Minor
>
> In YarnClusterSuite, it test cluster mode with:
> {code}
> if (conf.get("spark.master") == "yarn-cluster") {
> {code}
> But since SPARK-13220 change, conf.get("spark.master") will get "yarn". So 
> this *if* condition will always be *false*, and coding in this block will 
> never be executed. I thinks it should change to:
> {code}
> if (conf.get("spark.submit.deployMode") == "cluster") {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16064) Fix the GLM error caused by NA produced by reweight function

2016-06-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343745#comment-15343745
 ] 

Sean Owen commented on SPARK-16064:
---

Based on what though? you should compile the information publicly

> Fix the GLM error caused by NA produced by reweight function
> 
>
> Key: SPARK-16064
> URL: https://issues.apache.org/jira/browse/SPARK-16064
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Zhang Mengqi
>Priority: Minor
>
> This case happens when users run GLM in with SparkR, the same dataset runs 
> GLM well in native R.
> When users run the GLM model using glm with family of poisson, it generates a 
> assertion errors by NA produced by reweight function.
> 16/06/20 16:40:22 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.AssertionError: assertion failed: Sum of weights cannot be zero.
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares$Aggregator.validate(WeightedLeastSquares.scala:248)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:82)
>   at 
> org.apache.spark.ml.optim.IterativelyReweightedLeastSquares.fit(IterativelyReweightedLeastSquares.scala:85)
>   at 
> org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:276)
>   at 
> org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:134)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:148)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:144)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.Abstra
> P.S The dataset is about a city ride flow between several planning area in 
> Singapore.
> ride_flow_exp <- glm(flow~Origin+Destination+distance,ride_flow,family = 
> poisson(link = "log"))
> SparkDataFrame[Origin:string, Destination:string, flow:double, Oi:int, 
> Dj:int, distance:double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16125) YarnClusterSuite test cluster mode incorrectly

2016-06-21 Thread Peng Zhang (JIRA)

Peng Zhang created SPARK-16125:
--

 Summary: YarnClusterSuite test cluster mode incorrectly
 Key: SPARK-16125
 URL: https://issues.apache.org/jira/browse/SPARK-16125
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Peng Zhang
Priority: Minor


In YarnClusterSuite, it test cluster mode with:
{code}
if (conf.get("spark.master") == "yarn-cluster") {
{code}

But since SPARK-13220 change, conf.get("spark.master") will get "yarn". So this 
*if* condition will always be *false*, and coding in this block will never be 
executed. I thinks it should change to:
{code}
if (conf.get("spark.submit.deployMode") == "cluster") {
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16024) add tests for table creation with column comment

2016-06-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16024:

Description: (was: CREATE TABLE src(a INT COMMENT 'bla') USING parquet.

When we describe table, the column comment is not there.)

> add tests for table creation with column comment
> 
>
> Key: SPARK-16024
> URL: https://issues.apache.org/jira/browse/SPARK-16024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16024) add tests for table creation with column comment

2016-06-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16024:

Description: should test both hive serde tables and datasource tables

> add tests for table creation with column comment
> 
>
> Key: SPARK-16024
> URL: https://issues.apache.org/jira/browse/SPARK-16024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> should test both hive serde tables and datasource tables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16104) Do not creaate CSV writer object for every flush when writing

2016-06-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16104:
---
Assignee: Hyukjin Kwon

> Do not creaate CSV writer object for every flush when writing
> -
>
> Key: SPARK-16104
> URL: https://issues.apache.org/jira/browse/SPARK-16104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Initially, CSV data source creates {{CsvWriter}} for each record but it was 
> fixed in SPARK-14031.
> However, it still creates a writer for each flush in {{LineCsvWriter}}. This 
> is not necessary. It might be better if it uses single {{CsvWriter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16104) Do not creaate CSV writer object for every flush when writing

2016-06-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16104.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13809
[https://github.com/apache/spark/pull/13809]

> Do not creaate CSV writer object for every flush when writing
> -
>
> Key: SPARK-16104
> URL: https://issues.apache.org/jira/browse/SPARK-16104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Initially, CSV data source creates {{CsvWriter}} for each record but it was 
> fixed in SPARK-14031.
> However, it still creates a writer for each flush in {{LineCsvWriter}}. This 
> is not necessary. It might be better if it uses single {{CsvWriter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12173) Consider supporting DataSet API in SparkR

2016-06-21 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343699#comment-15343699
 ] 

Reynold Xin commented on SPARK-12173:
-

I don't think you are looking for the dataset API. You are just looking for 
some way to do UDF / aggregation?


> Consider supporting DataSet API in SparkR
> -
>
> Key: SPARK-12173
> URL: https://issues.apache.org/jira/browse/SPARK-12173
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`

2016-06-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16041:

Description: 
Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in 
DataFrameWriter. The duplicate columns could cause unpredictable results. For 
example, the resolution failure. 

We should detect the duplicates and issue exceptions with appropriate messages.


  was:
Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in 
DataFrameWriter. The duplicate columns could cause unpredictable results. For 
example, the resolution failure. 

We should detect the duplicates and issue exceptions with appropriate messages.



> Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
> 
>
> Key: SPARK-16041
> URL: https://issues.apache.org/jira/browse/SPARK-16041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in 
> DataFrameWriter. The duplicate columns could cause unpredictable results. For 
> example, the resolution failure. 
> We should detect the duplicates and issue exceptions with appropriate 
> messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`

2016-06-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16041:

Summary: Disallow Duplicate Columns in `partitionBy`, `bucketBy` and 
`sortBy`  (was: Disallow Duplicate Columns in `partitionBy`, `blockBy` and 
`sortBy`)

> Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
> 
>
> Key: SPARK-16041
> URL: https://issues.apache.org/jira/browse/SPARK-16041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in 
> DataFrameWriter. The duplicate columns could cause unpredictable results. For 
> example, the resolution failure. 
> We should detect the duplicates and issue exceptions with appropriate 
> messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy`

2016-06-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16041:

Summary: Disallow Duplicate Columns in `partitionBy`, `blockBy` and 
`sortBy`  (was: Disallow Duplicate Columns in `partitionBy`, `blockBy` and 
`sortBy` in DataFrameWriter)

> Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy`
> ---
>
> Key: SPARK-16041
> URL: https://issues.apache.org/jira/browse/SPARK-16041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in 
> DataFrameWriter. The duplicate columns could cause unpredictable results. For 
> example, the resolution failure. 
> We should detect the duplicates and issue exceptions with appropriate 
> messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16100) Aggregator fails with Tungsten error when complex types are used for results and partial sum

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343656#comment-15343656
 ] 

Apache Spark commented on SPARK-16100:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13835

> Aggregator fails with Tungsten error when complex types are used for results 
> and partial sum
> 
>
> Key: SPARK-16100
> URL: https://issues.apache.org/jira/browse/SPARK-16100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Deenar Toraskar
>
> I get a similar error when using complex types in Aggregator. Not sure if 
> this is the same issue or something else.
> {code:Agg.scala}
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.TypedColumn
> import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.{Encoder,Row}
> import sqlContext.implicits._
> object CustomSummer extends Aggregator[Valuation, Map[Int, Seq[Double]], 
> Seq[Seq[Double]]] with Serializable  {
>  def zero: Map[Int, Seq[Double]] = Map()
>  def reduce(b: Map[Int, Seq[Double]], a:Valuation): Map[Int, Seq[Double]] 
> = {
>val timeInterval: Int = a.timeInterval
>val currentSum: Seq[Double] = b.get(timeInterval).getOrElse(Nil)
>val currentRow: Seq[Double] = a.pvs
>b.updated(timeInterval, sumArray(currentSum, currentRow))
>  } 
> def sumArray(a: Seq[Double], b: Seq[Double]): Seq[Double] = Nil
>  def merge(b1: Map[Int, Seq[Double]], b2: Map[Int, Seq[Double]]): 
> Map[Int, Seq[Double]] = {
> /* merges two maps together ++ replaces any (k,v) from the map on the 
> left
> side of ++ (here map1) by (k,v) from the right side map, if (k,_) 
> already
> exists in the left side map (here map1), e.g. Map(1->1) ++ Map(1->2) 
> results in Map(1->2) */
> b1 ++ b2.map { case (timeInterval, exposures) =>
>   timeInterval -> sumArray(exposures, b1.getOrElse(timeInterval, Nil))
> }
>  }
>  def finish(exposures: Map[Int, Seq[Double]]): Seq[Seq[Double]] = 
>   {
> exposures.size match {
>   case 0 => null
>   case _ => {
> val range = exposures.keySet.max
> // convert map to 2 dimensional array, (timeInterval x 
> Seq[expScn1, expScn2, ...]
> (0 to range).map(x => exposures.getOrElse(x, Nil))
>   }
> }
>   }
>   override def bufferEncoder: Encoder[Map[Int,Seq[Double]]] = 
> ExpressionEncoder()
>   override def outputEncoder: Encoder[Seq[Seq[Double]]] = ExpressionEncoder()
>}
> case class Valuation(timeInterval : Int, pvs : Seq[Double])
> val valns = sc.parallelize(Seq(Valuation(0, Seq(1.0,2.0,3.0)),
>   Valuation(2, Seq(1.0,2.0,3.0)),
>   Valuation(1, Seq(1.0,2.0,3.0)),Valuation(2, Seq(1.0,2.0,3.0)),Valuation(0, 
> Seq(1.0,2.0,3.0))
>   )).toDS
> val g_c1 = 
> valns.groupByKey(_.timeInterval).agg(CustomSummer.toColumn).show(false)
> {code}
> I get the following error
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 19, localhost): java.lang.IndexOutOfBoundsException: 0
> at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
> at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
> at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:167)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:156)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:155)
> at 
>

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343243#comment-15343243
 ] 

Ryan Blue commented on SPARK-16032:
---

[~cloud_fan], while I think by-name insertion is important in the long run, my 
concern on this issue isn't that it is missing. It would be nice and it would 
help mitigate what I think is a problem for people that currently use the Hive 
integration. But, the point I'm trying to make is that I think this set of 
changes makes it very difficult to use Spark with Hive.

The first thing I will do to move on to Spark 2.0 is restore support for 
partitionBy with insertInto. I need to make a good case for that and write it 
up more clearly than I have so far, which I'll do tomorrow. But I want to let 
you guys know that I see these changes as a major blocker for us and probably 
others.

My -1 isn't binding and I realize there are deadlines that you guys are aiming 
for. I'm not trying to mess that up, I just want to let you know that I think 
the Hive changes need more discussion. Consequently, I would rather they wait 
until 2.1, even if we have to deal with the incompatibility.


An aside on the by-name resolution: I noted above that, rather than changing 
Hive/saveAsTable, we could use the InsertIntoTable with by-name resolution from 
my PR. My point isn't to try to push that feature into the release, it's that I 
think it supports the point that this is a big change set that was done in very 
little time. It doesn't look like there was a significant change around 
saveAsTable for Hive in the actual code, though.

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-21 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343235#comment-15343235
 ] 

Gayathri Murali commented on SPARK-16000:
-

[~yuhaoyan] I can help with this. 

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343232#comment-15343232
 ] 

Xiao Li commented on SPARK-16121:
-

I also saw this, but I thought this is by design. : )

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2016-06-21 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343215#comment-15343215
 ] 

Sun Rui commented on SPARK-12172:
-

currently spark.lapply() internally depends on RDD, we have to change this 
before removing RDD

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12173) Consider supporting DataSet API in SparkR

2016-06-21 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343212#comment-15343212
 ] 

Sun Rui commented on SPARK-12173:
-

[~rxin] yes R don't need compile time type safety, but map/reduce functions are 
popular in R, for example lapply() applies a function to each item of a list or 
vector. For now, sparkR support spark.lapply() similar to lapply(). The 
internal implementation internally depends on RDD. We could change the 
implementation to use Dataset but not exposing Dataset API, something like:
   change the R vector/list to a Dataset
   call Dataset functions on it
   Collect the result back as R vector/list
Not exposing Dataset API means SparkR does not provides distributed vector/list 
abstraction, SparkR users have to use DataFrame for distributed vector/list , 
which seems is not convenient to R users. 
[~shivaram] what do you think?

> Consider supporting DataSet API in SparkR
> -
>
> Key: SPARK-12173
> URL: https://issues.apache.org/jira/browse/SPARK-12173
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343176#comment-15343176
 ] 

Apache Spark commented on SPARK-16124:
--

User 'tilumi' has created a pull request for this issue:
https://github.com/apache/spark/pull/13833

> Throws exception when executing query on `build/sbt hive/console`
> -
>
> Key: SPARK-16124
> URL: https://issues.apache.org/jira/browse/SPARK-16124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: MIN-FU YANG
>Priority: Minor
>
> When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive 
> console which is from `build/sbt hive/console`, It throws exception:
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt;
>   at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 42 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan

2016-06-21 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-15326.
-
Resolution: Not A Problem

> Doing multiple unions on a Dataframe will result in a very inefficient query 
> plan
> -
>
> Key: SPARK-15326
> URL: https://issues.apache.org/jira/browse/SPARK-15326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Jurriaan Pruis
> Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt
>
>
> While working with a very skewed dataset I noticed that repeated unions on a 
> dataframe will result in a query plan with 2^(union) - 1 unions. With large 
> datasets this will be very inefficient.
> I tried to replicate this behaviour using a PySpark example (I've attached 
> the output of the explain() to this JIRA):
> {code}
> df = sqlCtx.range(1000)
> def r(name, max_val=100):
> return F.round(F.lit(max_val) * F.pow(F.rand(), 
> 4)).cast('integer').alias(name)
> # Create a skewed dataset
> skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f'))
> # Find the skewed values in the dataset
> top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], 
> 0.10).collect()[0]
> def skewjoin(skewed, right, column, freqItems):
> freqItems = freqItems[column + '_freqItems']
> skewed = skewed.alias('skewed')
> cond = F.col(column).isin(freqItems)
> # First broadcast join the frequent (skewed) values
> filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), 
> column, 'left_outer')
> # Use a regular join for the non skewed values (with big tables this will 
> use a SortMergeJoin)
> non_skewed = skewed.filter(cond == False).join(right.filter(cond == 
> False), column, 'left_outer')
> # join them together and replace the column with the column found in the 
> right DataFrame
> return filtered.unionAll(non_skewed).select('skewed.*', 
> right['id'].alias(column + '_key')).drop(column)
> # Create the dataframes that will be joined to the skewed dataframe
> right_size = 100
> df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a'))
> df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b'))
> df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c'))
> df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d'))
> df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e'))
> df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f'))
> # Join everything together
> df = skewed
> df = skewjoin(df, df_a, 'a', top_10_percent)
> df = skewjoin(df, df_b, 'b', top_10_percent)
> df = skewjoin(df, df_c, 'c', top_10_percent)
> df = skewjoin(df, df_d, 'd', top_10_percent)
> df = skewjoin(df, df_e, 'e', top_10_percent)
> df = skewjoin(df, df_f, 'f', top_10_percent)
> # df.explain() shows the plan where it does 63 unions 
> (2^(number_of_skewjoins) - 1)
> # which will be very inefficient and slow
> df.explain(True)
> # Evaluate the plan
> # You'd expect this to return 1000, but it does not, it returned 1140 
> on my system
> # (probably because it will recalculate the random columns? Not sure though)
> print(df.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16124:


Assignee: Apache Spark

> Throws exception when executing query on `build/sbt hive/console`
> -
>
> Key: SPARK-16124
> URL: https://issues.apache.org/jira/browse/SPARK-16124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: MIN-FU YANG
>Assignee: Apache Spark
>Priority: Minor
>
> When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive 
> console which is from `build/sbt hive/console`, It throws exception:
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt;
>   at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 42 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16124:


Assignee: (was: Apache Spark)

> Throws exception when executing query on `build/sbt hive/console`
> -
>
> Key: SPARK-16124
> URL: https://issues.apache.org/jira/browse/SPARK-16124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: MIN-FU YANG
>Priority: Minor
>
> When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive 
> console which is from `build/sbt hive/console`, It throws exception:
> org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
> file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt;
>   at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at 
> org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at 
> org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462)
>   at 
> org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 42 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan

2016-06-21 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343175#comment-15343175
 ] 

Herman van Hovell commented on SPARK-15326:
---

So we have code that will flatten nested Unions. For example {{UNION(UNION(a, 
b), UNION(c, d))}} will be converted into {{UNION(a, b, c, d}}.

This case however is a bit different. You are joining the {{right}} table twice 
here once to the skewed values and once to the regular values, and then you are 
joining them back together. So you are incorporating the right hand side twice 
- which will cause a nice blow-up if you do this a few times. You may be able 
to rewrite without the union by doing something in the line of {{select * from 
right left join broadcast(skewed) left join not-skewed}}.

> Doing multiple unions on a Dataframe will result in a very inefficient query 
> plan
> -
>
> Key: SPARK-15326
> URL: https://issues.apache.org/jira/browse/SPARK-15326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Jurriaan Pruis
> Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt
>
>
> While working with a very skewed dataset I noticed that repeated unions on a 
> dataframe will result in a query plan with 2^(union) - 1 unions. With large 
> datasets this will be very inefficient.
> I tried to replicate this behaviour using a PySpark example (I've attached 
> the output of the explain() to this JIRA):
> {code}
> df = sqlCtx.range(1000)
> def r(name, max_val=100):
> return F.round(F.lit(max_val) * F.pow(F.rand(), 
> 4)).cast('integer').alias(name)
> # Create a skewed dataset
> skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f'))
> # Find the skewed values in the dataset
> top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], 
> 0.10).collect()[0]
> def skewjoin(skewed, right, column, freqItems):
> freqItems = freqItems[column + '_freqItems']
> skewed = skewed.alias('skewed')
> cond = F.col(column).isin(freqItems)
> # First broadcast join the frequent (skewed) values
> filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), 
> column, 'left_outer')
> # Use a regular join for the non skewed values (with big tables this will 
> use a SortMergeJoin)
> non_skewed = skewed.filter(cond == False).join(right.filter(cond == 
> False), column, 'left_outer')
> # join them together and replace the column with the column found in the 
> right DataFrame
> return filtered.unionAll(non_skewed).select('skewed.*', 
> right['id'].alias(column + '_key')).drop(column)
> # Create the dataframes that will be joined to the skewed dataframe
> right_size = 100
> df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a'))
> df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b'))
> df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c'))
> df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d'))
> df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e'))
> df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f'))
> # Join everything together
> df = skewed
> df = skewjoin(df, df_a, 'a', top_10_percent)
> df = skewjoin(df, df_b, 'b', top_10_percent)
> df = skewjoin(df, df_c, 'c', top_10_percent)
> df = skewjoin(df, df_d, 'd', top_10_percent)
> df = skewjoin(df, df_e, 'e', top_10_percent)
> df = skewjoin(df, df_f, 'f', top_10_percent)
> # df.explain() shows the plan where it does 63 unions 
> (2^(number_of_skewjoins) - 1)
> # which will be very inefficient and slow
> df.explain(True)
> # Evaluate the plan
> # You'd expect this to return 1000, but it does not, it returned 1140 
> on my system
> # (probably because it will recalculate the random columns? Not sure though)
> print(df.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`

2016-06-21 Thread MIN-FU YANG (JIRA)

MIN-FU YANG created SPARK-16124:
---

 Summary: Throws exception when executing query on `build/sbt 
hive/console`
 Key: SPARK-16124
 URL: https://issues.apache.org/jira/browse/SPARK-16124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: MIN-FU YANG
Priority: Minor


When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive 
console which is from `build/sbt hive/console`, It throws exception:

org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt;
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
  at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
  ... 42 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343029#comment-15343029
 ] 

Wenchen Fan edited comment on SPARK-16032 at 6/22/16 1:15 AM:
--

I think it doesn't make sense to use `partitionBy` with `insertInto`, as we can 
not map `DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:

1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not 
the table to be inserted.

And it's already broken(mostly) in 1.6, according to the test cases at 
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only 
case that seems reasonable in 1.6 is when the data to insert has same schema 
with the table to be inserted and the `partitionBy` specifies the correct 
partition columns. But I think it's worth to break it and make the overall 
semantics more clear.

Maybe we are wrong, it will be good if we come up with a clean semantics to 
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time 
on it, we failed, and that's why we wanna make these changes and rush in into 
2.0.


was (Author: cloud_fan):
I think it's nonsense to use `partitionBy` with `insertInto`, as we can not map 
`DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:

1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not 
the table to be inserted.

And it's already broken(mostly) in 1.6, according to the test cases at 
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only 
case that seems reasonable in 1.6 is when the data to insert has same schema 
with the table to be inserted and the `partitionBy` specifies the correct 
partition columns. But I think it's worth to break it and make the overall 
semantics more clear.

Maybe we are wrong, it will be good if we come up with a clean semantics to 
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time 
on it, we failed, and that's why we wanna make these changes and rush in into 
2.0.

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16123:


Assignee: (was: Apache Spark)

> Avoid NegativeArraySizeException while reserving additional capacity in 
> VectorizedColumnReader
> --
>
> Key: SPARK-16123
> URL: https://issues.apache.org/jira/browse/SPARK-16123
> Project: Spark
>  Issue Type: Bug
>Reporter: Sameer Agarwal
>
> Both off-heap and on-heap variants of ColumnVector.reserve() can 
> unfortunately overflow while reserving additional capacity during reads.
> {code}
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16123:


Assignee: Apache Spark

> Avoid NegativeArraySizeException while reserving additional capacity in 
> VectorizedColumnReader
> --
>
> Key: SPARK-16123
> URL: https://issues.apache.org/jira/browse/SPARK-16123
> Project: Spark
>  Issue Type: Bug
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>
> Both off-heap and on-heap variants of ColumnVector.reserve() can 
> unfortunately overflow while reserving additional capacity during reads.
> {code}
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343139#comment-15343139
 ] 

Apache Spark commented on SPARK-16123:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/13832

> Avoid NegativeArraySizeException while reserving additional capacity in 
> VectorizedColumnReader
> --
>
> Key: SPARK-16123
> URL: https://issues.apache.org/jira/browse/SPARK-16123
> Project: Spark
>  Issue Type: Bug
>Reporter: Sameer Agarwal
>
> Both off-heap and on-heap variants of ColumnVector.reserve() can 
> unfortunately overflow while reserving additional capacity during reads.
> {code}
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

2016-06-21 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-16123:
---
Description: 
Both off-heap and on-heap variants of ColumnVector.reserve() can unfortunately 
overflow while reserving additional capacity during reads.

{code}
Caused by: java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397)
at 
org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675)
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169)
{code} 

> Avoid NegativeArraySizeException while reserving additional capacity in 
> VectorizedColumnReader
> --
>
> Key: SPARK-16123
> URL: https://issues.apache.org/jira/browse/SPARK-16123
> Project: Spark
>  Issue Type: Bug
>Reporter: Sameer Agarwal
>
> Both off-heap and on-heap variants of ColumnVector.reserve() can 
> unfortunately overflow while reserving additional capacity during reads.
> {code}
> Caused by: java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169)
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16102) Use Record API from Univocity rather than current data cast API.

2016-06-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-16102:
-
Affects Version/s: 2.0.0

> Use Record API from Univocity rather than current data cast API.
> 
>
> Key: SPARK-16102
> URL: https://issues.apache.org/jira/browse/SPARK-16102
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> There is Record API for Univocity parser.
> This API provides typed data. Spark currently tries to compare and cast each 
> data.
> Using this library should reduce the codes in Spark and maybe improve the 
> performance. 
> It seems a benchmark should be proceeded first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader

2016-06-21 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-16123:
--

 Summary: Avoid NegativeArraySizeException while reserving 
additional capacity in VectorizedColumnReader
 Key: SPARK-16123
 URL: https://issues.apache.org/jira/browse/SPARK-16123
 Project: Spark
  Issue Type: Bug
Reporter: Sameer Agarwal






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121
 ] 

MIN-FU YANG commented on SPARK-14172:
-

I cannot reproduce it on 1.6.1 either. Could you give more detailed description?

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121
 ] 

MIN-FU YANG edited comment on SPARK-14172 at 6/22/16 12:48 AM:
---

I cannot reproduce it on 1.6.1 either. Could you give more detailed description 
or verify it?


was (Author: lucasmf):
I cannot reproduce it on 1.6.1 either. Could you give more detailed description?

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16119:


Assignee: (was: Apache Spark)

> Support "DROP TABLE ... PURGE" if Hive client supports it
> -
>
> Key: SPARK-16119
> URL: https://issues.apache.org/jira/browse/SPARK-16119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> There's currently code that explicitly disables the "PURGE" flag when 
> dropping a table:
> {code}
> if (ctx.PURGE != null) {
>   throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
> }
> {code}
> That flag is necessary in certain situations where the table data cannot be 
> moved to the trash (which will be tried unless "PURGE" is requested). If the 
> client supports it (Hive >= 0.14.0 according to the Hive docs), we should 
> allow that option to be defined.
> For non-Hive tables, as far as I can understand, "PURGE" is the current 
> behavior of Spark.
> The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
> should probably also be covered by this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343106#comment-15343106
 ] 

Apache Spark commented on SPARK-16119:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13831

> Support "DROP TABLE ... PURGE" if Hive client supports it
> -
>
> Key: SPARK-16119
> URL: https://issues.apache.org/jira/browse/SPARK-16119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> There's currently code that explicitly disables the "PURGE" flag when 
> dropping a table:
> {code}
> if (ctx.PURGE != null) {
>   throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
> }
> {code}
> That flag is necessary in certain situations where the table data cannot be 
> moved to the trash (which will be tried unless "PURGE" is requested). If the 
> client supports it (Hive >= 0.14.0 according to the Hive docs), we should 
> allow that option to be defined.
> For non-Hive tables, as far as I can understand, "PURGE" is the current 
> behavior of Spark.
> The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
> should probably also be covered by this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16119:


Assignee: Apache Spark

> Support "DROP TABLE ... PURGE" if Hive client supports it
> -
>
> Key: SPARK-16119
> URL: https://issues.apache.org/jira/browse/SPARK-16119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> There's currently code that explicitly disables the "PURGE" flag when 
> dropping a table:
> {code}
> if (ctx.PURGE != null) {
>   throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
> }
> {code}
> That flag is necessary in certain situations where the table data cannot be 
> moved to the trash (which will be tried unless "PURGE" is requested). If the 
> client supports it (Hive >= 0.14.0 according to the Hive docs), we should 
> allow that option to be defined.
> For non-Hive tables, as far as I can understand, "PURGE" is the current 
> behavior of Spark.
> The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
> should probably also be covered by this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16121:


Assignee: Apache Spark

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16121:


Assignee: (was: Apache Spark)

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343069#comment-15343069
 ] 

Apache Spark commented on SPARK-16121:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13830

> ListingFileCatalog does not list in parallel anymore
> 
>
> Key: SPARK-16121
> URL: https://issues.apache.org/jira/browse/SPARK-16121
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown 
> below. When the number of user-provided paths is less than the value of 
> {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we 
> will not use parallel listing, which is different from what 1.6 does (for 
> 1.6, if the number of children of any inner dir is larger than the threshold, 
> we will use the parallel listing).
> {code}
> protected def listLeafFiles(paths: Seq[Path]): 
> mutable.LinkedHashSet[FileStatus] = {
> if (paths.length >= 
> sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
>   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
> sparkSession)
> } else {
>   // Dummy jobconf to get to the pathFilter defined in configuration
>   val jobConf = new JobConf(hadoopConf, this.getClass)
>   val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
>   val statuses: Seq[FileStatus] = paths.flatMap { path =>
> val fs = path.getFileSystem(hadoopConf)
> logInfo(s"Listing $path on driver")
> Try {
>   HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), 
> pathFilter)
> }.getOrElse(Array.empty[FileStatus])
>   }
>   mutable.LinkedHashSet(statuses: _*)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16071:


Assignee: Apache Spark

> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of later throwing NegativeArraySize (which is 
> slow and might cause silent errors)
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but slow (see SPARK-16043)
> * n=2e8: NegativeArraySize exception
> {code:none}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> * n=3e8: NegativeArraySize exception but raised at a different location
> {code:none}
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NegativeArraySizeException
> newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
> value#108
> +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
>+- input[0, [I, true]
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
>

[jira] [Assigned] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16071:


Assignee: (was: Apache Spark)

> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of later throwing NegativeArraySize (which is 
> slow and might cause silent errors)
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but slow (see SPARK-16043)
> * n=2e8: NegativeArraySize exception
> {code:none}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> * n=3e8: NegativeArraySize exception but raised at a different location
> {code:none}
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NegativeArraySizeException
> newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
> value#108
> +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
>+- input[0, [I, true]
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
>

[jira] [Commented] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343067#comment-15343067
 ] 

Apache Spark commented on SPARK-16071:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/13829

> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of later throwing NegativeArraySize (which is 
> slow and might cause silent errors)
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but slow (see SPARK-16043)
> * n=2e8: NegativeArraySize exception
> {code:none}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> * n=3e8: NegativeArraySize exception but raised at a different location
> {code:none}
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NegativeArraySizeException
> newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
> value#108
> +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
>+- input[0, [I, true]
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
>

[jira] [Created] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application

2016-06-21 Thread Neelesh Srinivas Salian (JIRA)

Neelesh Srinivas Salian created SPARK-16122:
---

 Summary: Spark History Server REST API missing an environment 
endpoint per application
 Key: SPARK-16122
 URL: https://issues.apache.org/jira/browse/SPARK-16122
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, Web UI
Affects Versions: 1.6.1
Reporter: Neelesh Srinivas Salian
Priority: Minor


The WebUI for the Spark History Server has the Environment tab that allows you 
to view the Environment for that job.
With Runtime , Spark properties...etc.

How about adding an endpoint to the REST API that looks and points to this 
environment tab for that application?

/applications/[app-id]/environment

Added Docs too so that we can spawn a subsequent Documentation addition to get 
it included in the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update

2016-06-21 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343059#comment-15343059
 ] 

Joseph K. Bradley commented on SPARK-15643:
---

A few more deprecations to add from current PRs:
* [https://github.com/apache/spark/pull/13823]
* [https://github.com/apache/spark/pull/13380]

> ML 2.0 QA: migration guide update
> -
>
> Key: SPARK-15643
> URL: https://issues.apache.org/jira/browse/SPARK-15643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Update spark.ml and spark.mllib migration guide from 1.6 to 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343036#comment-15343036
 ] 

Wenchen Fan commented on SPARK-16032:
-

[~rdblue] I think the biggest problem is we don't have by-name insertion for 
hive table now.  Although this feature is not in 1.6, but we do have it before 
the recent commits by us.  Does it make more sense to support hive table in 
`DataFrameWriter.saveAsTable` before 2.0?

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343029#comment-15343029
 ] 

Wenchen Fan commented on SPARK-16032:
-

I think it's nonsense to use `partitionBy` with `insertInto`, as we can not map 
`DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:

1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not 
the table to be inserted.

And it's already broken(mostly) in 1.6, according to the test cases at 
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only 
case that seems reasonable in 1.6 is when the data to insert has same schema 
with the table to be inserted and the `partitionBy` specifies the correct 
partition columns. But I think it's worth to break it and make the overall 
semantics more clear.

Maybe we are wrong, it will be good if we come up with a clean semantics to 
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time 
on it, we failed, and that's why we wanna make these changes and rush in into 
2.0.

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16117) Hide LibSVMFileFormat in public API docs

2016-06-21 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16117.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Hide LibSVMFileFormat in public API docs
> 
>
> Key: SPARK-16117
> URL: https://issues.apache.org/jira/browse/SPARK-16117
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> LibSVMFileFormat implements data source for LIBSVM format. However, users do 
> not need to call its APIs to use it. So we should hide it in the public API 
> docs. The main issue is that we still need to put the documentation and 
> example code somewhere. The proposal it to have a dummy object to hold the 
> documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16118.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13821
[https://github.com/apache/spark/pull/13821]

> getDropLast is missing in OneHotEncoder
> ---
>
> Key: SPARK-16118
> URL: https://issues.apache.org/jira/browse/SPARK-16118
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it

2016-06-21 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-16119:
---
Description: 
There's currently code that explicitly disables the "PURGE" flag when dropping 
a table:

{code}
if (ctx.PURGE != null) {
  throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
}
{code}

That flag is necessary in certain situations where the table data cannot be 
moved to the trash (which will be tried unless "PURGE" is requested). If the 
client supports it (Hive >= 0.14.0 according to the Hive docs), we should allow 
that option to be defined.

For non-Hive tables, as far as I can understand, "PURGE" is the current 
behavior of Spark.

The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
should probably also be covered by this change.

  was:
There's currently code that explicitly disables the "PURGE" flag when dropping 
a table:

{code}
if (ctx.PURGE != null) {
  throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
}
{code}

That flag is necessary in certain situations where the table data cannot be 
moved to the trash (which will be tried unless "PURGE" is requested). If the 
client supports it (Hive >= 0.14.0 according to the Hive docs), we should allow 
that option to be defined.

For non-Hive tables, as far as I can understand, "PURGE" is the current 
behavior of Spark.

The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
should probably be covered.


> Support "DROP TABLE ... PURGE" if Hive client supports it
> -
>
> Key: SPARK-16119
> URL: https://issues.apache.org/jira/browse/SPARK-16119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> There's currently code that explicitly disables the "PURGE" flag when 
> dropping a table:
> {code}
> if (ctx.PURGE != null) {
>   throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
> }
> {code}
> That flag is necessary in certain situations where the table data cannot be 
> moved to the trash (which will be tried unless "PURGE" is requested). If the 
> client supports it (Hive >= 0.14.0 according to the Hive docs), we should 
> allow that option to be defined.
> For non-Hive tables, as far as I can understand, "PURGE" is the current 
> behavior of Spark.
> The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
> should probably also be covered by this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342963#comment-15342963
 ] 

Ryan Blue commented on SPARK-16032:
---

I'm referring to disabling the use of {{partitionBy}} with {{insertInto}} and 
removing support for {{saveAsTable}} (from the doc: "I think it's fine we drop 
the support in 2.0"). In 1.6, {{partitionBy}} can be used to set up partition 
columns as they are expected by {{insertInto}}.

What change caused the confusing and inconsistent behavior? Before this set of 
changes, {{partitionBy}} was validated against the table's partitioning (at 
least for Hive) like this change set suggests doing when it is used with 
{{saveAsTable}}. The insert SQL case is inconsistent, but there are other ways 
to solve that problem.

bq. I do not think we should ship 2.0 without fixing these behaviors and try to 
fix them in future releases (the fix will possible change the behaviors again).

I know that we want to get this out, but I don't think it is a good idea to put 
it in 2.0 before it's ready. This codifies that the "correct" way to write to a 
Hive table is to put the partition columns at the end rather than explicitly 
marking them, and it disallows marking those columns as you would using 
"PARTITION" in SQL. That's going to break jobs and I'm not confident that it's 
the right choice.

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package

2016-06-21 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342930#comment-15342930
 ] 

Miao Wang commented on SPARK-16075:
---

I will follow on this one. Thanks!

> Make VectorUDT/MatrixUDT singleton under spark.ml package
> -
>
> Key: SPARK-16075
> URL: https://issues.apache.org/jira/browse/SPARK-16075
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Both VectorUDT and MatrixUDT are implemented as normal classes and their 
> could be multiple instances of it, which makes the equality checking and 
> pattern matching harder to implement. Even the APIs are private, switching to 
> a singleton pattern could simplify the development.
> Required changes:
> * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate)
> * update UDTRegistration
> * update code generation to support singleton UDTs
> * update existing code to use getOrCreate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16106:


Assignee: Apache Spark

> TaskSchedulerImpl does not correctly handle new executors on existing hosts
> ---
>
> Key: SPARK-16106
> URL: https://issues.apache.org/jira/browse/SPARK-16106
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Trivial
>
> The TaskSchedulerImpl updates the set of executors and hosts in each call to 
> {{resourceOffers}}.  During this call, it also tracks whether there are any 
> new executors observed in {{newExecAvail}}:
> {code}
>   executorIdToHost(o.executorId) = o.host
>   executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
>   if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
> {code}
> However, this only detects when a new *host* is added, not when an additional 
> executor is added to an existing host (a relatively common event in dynamic 
> allocation).
> The end result is that task locality and {{failedEpochs}} is not updated 
> correctly for new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16106:


Assignee: (was: Apache Spark)

> TaskSchedulerImpl does not correctly handle new executors on existing hosts
> ---
>
> Key: SPARK-16106
> URL: https://issues.apache.org/jira/browse/SPARK-16106
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Priority: Trivial
>
> The TaskSchedulerImpl updates the set of executors and hosts in each call to 
> {{resourceOffers}}.  During this call, it also tracks whether there are any 
> new executors observed in {{newExecAvail}}:
> {code}
>   executorIdToHost(o.executorId) = o.host
>   executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
>   if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
> {code}
> However, this only detects when a new *host* is added, not when an additional 
> executor is added to an existing host (a relatively common event in dynamic 
> allocation).
> The end result is that task locality and {{failedEpochs}} is not updated 
> correctly for new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342926#comment-15342926
 ] 

Apache Spark commented on SPARK-16106:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/13826

> TaskSchedulerImpl does not correctly handle new executors on existing hosts
> ---
>
> Key: SPARK-16106
> URL: https://issues.apache.org/jira/browse/SPARK-16106
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Priority: Trivial
>
> The TaskSchedulerImpl updates the set of executors and hosts in each call to 
> {{resourceOffers}}.  During this call, it also tracks whether there are any 
> new executors observed in {{newExecAvail}}:
> {code}
>   executorIdToHost(o.executorId) = o.host
>   executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
>   if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
> {code}
> However, this only detects when a new *host* is added, not when an additional 
> executor is added to an existing host (a relatively common event in dynamic 
> allocation).
> The end result is that task locality and {{failedEpochs}} is not updated 
> correctly for new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342917#comment-15342917
 ] 

MIN-FU YANG commented on SPARK-14172:
-

Hi, I cannot reproduce the problem in master branch. Maybe it's resolved in 
newer version. I'll try reproduce it in 1.6.1

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16121) ListingFileCatalog does not list in parallel anymore

2016-06-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-16121:


 Summary: ListingFileCatalog does not list in parallel anymore
 Key: SPARK-16121
 URL: https://issues.apache.org/jira/browse/SPARK-16121
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown below. 
When the number of user-provided paths is less than the value of 
{{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we will 
not use parallel listing, which is different from what 1.6 does (for 1.6, if 
the number of children of any inner dir is larger than the threshold, we will 
use the parallel listing).
{code}
protected def listLeafFiles(paths: Seq[Path]): 
mutable.LinkedHashSet[FileStatus] = {
if (paths.length >= 
sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
  HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession)
} else {
  // Dummy jobconf to get to the pathFilter defined in configuration
  val jobConf = new JobConf(hadoopConf, this.getClass)
  val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
  val statuses: Seq[FileStatus] = paths.flatMap { path =>
val fs = path.getFileSystem(hadoopConf)
logInfo(s"Listing $path on driver")
Try {
  HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter)
}.getOrElse(Array.empty[FileStatus])
  }
  mutable.LinkedHashSet(statuses: _*)
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342856#comment-15342856
 ] 

Yin Huai commented on SPARK-16032:
--

Regarding {{disabling Hive features}}, can you be more specific? 

Looking at the summary, we actually changed the behaviors from 1.6, which 
caused confusing and inconsistent behaviors. I do not think we should ship 2.0 
without fixing these behaviors and try to fix them in future releases (the fix 
will possible change the behaviors again).

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Blocker
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

2016-06-21 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16032:
-
Priority: Critical  (was: Blocker)

> Audit semantics of various insertion operations related to partitioned tables
> -
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16120:


Assignee: (was: Apache Spark)

> getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" 
> case uses external variable instead of the passed parameter
> --
>
> Key: SPARK-16120
> URL: https://issues.apache.org/jira/browse/SPARK-16120
> Project: Spark
>  Issue Type: Test
>  Components: Streaming, Tests
>Reporter: Ahmed Mahran
>Priority: Trivial
>  Labels: easyfix, newbie, test
>
> In ReceiverSuite.scala, in the test case "write ahead log - generating and 
> cleaning", the inner method getCurrentLogFiles uses external variable 
> logDirectory1 instead of the passed parameter logDirectory
> {code}
> def getCurrentLogFiles(logDirectory: File): Seq[String] = {
>   try {
> if (logDirectory.exists()) {
>   logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { 
> _.toString }
> } else {
>   Seq.empty
> }
>   } catch {
> case e: Exception =>
>   Seq.empty
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16120:


Assignee: Apache Spark

> getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" 
> case uses external variable instead of the passed parameter
> --
>
> Key: SPARK-16120
> URL: https://issues.apache.org/jira/browse/SPARK-16120
> Project: Spark
>  Issue Type: Test
>  Components: Streaming, Tests
>Reporter: Ahmed Mahran
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: easyfix, newbie, test
>
> In ReceiverSuite.scala, in the test case "write ahead log - generating and 
> cleaning", the inner method getCurrentLogFiles uses external variable 
> logDirectory1 instead of the passed parameter logDirectory
> {code}
> def getCurrentLogFiles(logDirectory: File): Seq[String] = {
>   try {
> if (logDirectory.exists()) {
>   logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { 
> _.toString }
> } else {
>   Seq.empty
> }
>   } catch {
> case e: Exception =>
>   Seq.empty
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342852#comment-15342852
 ] 

Apache Spark commented on SPARK-16120:
--

User 'ahmed-mahran' has created a pull request for this issue:
https://github.com/apache/spark/pull/13825

> getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" 
> case uses external variable instead of the passed parameter
> --
>
> Key: SPARK-16120
> URL: https://issues.apache.org/jira/browse/SPARK-16120
> Project: Spark
>  Issue Type: Test
>  Components: Streaming, Tests
>Reporter: Ahmed Mahran
>Priority: Trivial
>  Labels: easyfix, newbie, test
>
> In ReceiverSuite.scala, in the test case "write ahead log - generating and 
> cleaning", the inner method getCurrentLogFiles uses external variable 
> logDirectory1 instead of the passed parameter logDirectory
> {code}
> def getCurrentLogFiles(logDirectory: File): Seq[String] = {
>   try {
> if (logDirectory.exists()) {
>   logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { 
> _.toString }
> } else {
>   Seq.empty
> }
>   } catch {
> case e: Exception =>
>   Seq.empty
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan

2016-06-21 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342850#comment-15342850
 ] 

MIN-FU YANG commented on SPARK-15326:
-

I'll look into it.

> Doing multiple unions on a Dataframe will result in a very inefficient query 
> plan
> -
>
> Key: SPARK-15326
> URL: https://issues.apache.org/jira/browse/SPARK-15326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Jurriaan Pruis
> Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt
>
>
> While working with a very skewed dataset I noticed that repeated unions on a 
> dataframe will result in a query plan with 2^(union) - 1 unions. With large 
> datasets this will be very inefficient.
> I tried to replicate this behaviour using a PySpark example (I've attached 
> the output of the explain() to this JIRA):
> {code}
> df = sqlCtx.range(1000)
> def r(name, max_val=100):
> return F.round(F.lit(max_val) * F.pow(F.rand(), 
> 4)).cast('integer').alias(name)
> # Create a skewed dataset
> skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f'))
> # Find the skewed values in the dataset
> top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], 
> 0.10).collect()[0]
> def skewjoin(skewed, right, column, freqItems):
> freqItems = freqItems[column + '_freqItems']
> skewed = skewed.alias('skewed')
> cond = F.col(column).isin(freqItems)
> # First broadcast join the frequent (skewed) values
> filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), 
> column, 'left_outer')
> # Use a regular join for the non skewed values (with big tables this will 
> use a SortMergeJoin)
> non_skewed = skewed.filter(cond == False).join(right.filter(cond == 
> False), column, 'left_outer')
> # join them together and replace the column with the column found in the 
> right DataFrame
> return filtered.unionAll(non_skewed).select('skewed.*', 
> right['id'].alias(column + '_key')).drop(column)
> # Create the dataframes that will be joined to the skewed dataframe
> right_size = 100
> df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a'))
> df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b'))
> df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c'))
> df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d'))
> df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e'))
> df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f'))
> # Join everything together
> df = skewed
> df = skewjoin(df, df_a, 'a', top_10_percent)
> df = skewjoin(df, df_b, 'b', top_10_percent)
> df = skewjoin(df, df_c, 'c', top_10_percent)
> df = skewjoin(df, df_d, 'd', top_10_percent)
> df = skewjoin(df, df_e, 'e', top_10_percent)
> df = skewjoin(df, df_f, 'f', top_10_percent)
> # df.explain() shows the plan where it does 63 unions 
> (2^(number_of_skewjoins) - 1)
> # which will be very inefficient and slow
> df.explain(True)
> # Evaluate the plan
> # You'd expect this to return 1000, but it does not, it returned 1140 
> on my system
> # (probably because it will recalculate the random columns? Not sure though)
> print(df.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter

2016-06-21 Thread Ahmed Mahran (JIRA)

Ahmed Mahran created SPARK-16120:


 Summary: getCurrentLogFiles method in ReceiverSuite "WAL - 
generating and cleaning" case uses external variable instead of the passed 
parameter
 Key: SPARK-16120
 URL: https://issues.apache.org/jira/browse/SPARK-16120
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Ahmed Mahran
Priority: Trivial


In ReceiverSuite.scala, in the test case "write ahead log - generating and 
cleaning", the inner method getCurrentLogFiles uses external variable 
logDirectory1 instead of the passed parameter logDirectory

{code}
def getCurrentLogFiles(logDirectory: File): Seq[String] = {
  try {
if (logDirectory.exists()) {
  logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { 
_.toString }
} else {
  Seq.empty
}
  } catch {
case e: Exception =>
  Seq.empty
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16110) Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16110:


Assignee: (was: Apache Spark)

> Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & 
> PYSPARK_DRIVER_PYTHON are set
> ---
>
> Key: SPARK-16110
> URL: https://issues.apache.org/jira/browse/SPARK-16110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-38-generic x86_64), 
> Spark 1.6.1, Azure HDInsight 3.4)
>Reporter: Kevin Grealish
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When a cluster has PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment 
> variables set (needed for using non-system Python e.g. 
> /usr/bin/anaconda/bin/python), then you are unable to override this per 
> submission in YARN cluster mode.
> When using spark-submit (in this case via LIVY) to submit with an override:
> spark-submit --master yarn --deploy-mode cluster --conf 
> 'spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3' --conf' 
> 'spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3' probe.py
> the environment variable values will override the conf settings. A workaround 
> for some can be to unset the env vars but that is not always possible (e.g. 
> submitting batch via LIVY where you can only pass through the parameters to 
> spark-submit).
> Expectation is that the conf values above override the environment variables.
> Fix is to change the order of application of conf and env vars in the yarn 
> client.
> Related discussion:https://issues.cloudera.org/browse/LIVY-159
> Backporting this to 1.6 would be great and unblocking for me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16110) Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16110:


Assignee: Apache Spark

> Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & 
> PYSPARK_DRIVER_PYTHON are set
> ---
>
> Key: SPARK-16110
> URL: https://issues.apache.org/jira/browse/SPARK-16110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-38-generic x86_64), 
> Spark 1.6.1, Azure HDInsight 3.4)
>Reporter: Kevin Grealish
>Assignee: Apache Spark
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When a cluster has PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment 
> variables set (needed for using non-system Python e.g. 
> /usr/bin/anaconda/bin/python), then you are unable to override this per 
> submission in YARN cluster mode.
> When using spark-submit (in this case via LIVY) to submit with an override:
> spark-submit --master yarn --deploy-mode cluster --conf 
> 'spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3' --conf' 
> 'spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3' probe.py
> the environment variable values will override the conf settings. A workaround 
> for some can be to unset the env vars but that is not always possible (e.g. 
> submitting batch via LIVY where you can only pass through the parameters to 
> spark-submit).
> Expectation is that the conf values above override the environment variables.
> Fix is to change the order of application of conf and env vars in the yarn 
> client.
> Related discussion:https://issues.cloudera.org/browse/LIVY-159
> Backporting this to 1.6 would be great and unblocking for me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16110) Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342831#comment-15342831
 ] 

Apache Spark commented on SPARK-16110:
--

User 'KevinGrealish' has created a pull request for this issue:
https://github.com/apache/spark/pull/13824

> Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & 
> PYSPARK_DRIVER_PYTHON are set
> ---
>
> Key: SPARK-16110
> URL: https://issues.apache.org/jira/browse/SPARK-16110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-38-generic x86_64), 
> Spark 1.6.1, Azure HDInsight 3.4)
>Reporter: Kevin Grealish
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When a cluster has PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment 
> variables set (needed for using non-system Python e.g. 
> /usr/bin/anaconda/bin/python), then you are unable to override this per 
> submission in YARN cluster mode.
> When using spark-submit (in this case via LIVY) to submit with an override:
> spark-submit --master yarn --deploy-mode cluster --conf 
> 'spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3' --conf' 
> 'spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3' probe.py
> the environment variable values will override the conf settings. A workaround 
> for some can be to unset the env vars but that is not always possible (e.g. 
> submitting batch via LIVY where you can only pass through the parameters to 
> spark-submit).
> Expectation is that the conf values above override the environment variables.
> Fix is to change the order of application of conf and env vars in the yarn 
> client.
> Related discussion:https://issues.cloudera.org/browse/LIVY-159
> Backporting this to 1.6 would be great and unblocking for me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts

2016-06-21 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342821#comment-15342821
 ] 

Imran Rashid commented on SPARK-16106:
--

cc [~kayousterhout]

After taking a closer look at this, I don't think its really a very big issue.  
Since you've already got some executor on the host, your locality level will 
already include *at least* NODE_LOCAL.  Its possible that when you add another 
executor, now you can go to PROCESS_LOCAL -- but I can't think of how you'd 
have a task which wants to be PROCESS_LOCAL in an executor which doesn't exist 
yet.

There is also the issue w/ failedEpoch, I haven't wrapped my head around what 
the ramifications of that are yet.

In any case, its confusing at the very least, might as well fix this.  But it 
might be reasonable to simply change {{executorAdded()}} to {{hostAdded()}} to 
clear up the naming, and leave the current behavior as well.

> TaskSchedulerImpl does not correctly handle new executors on existing hosts
> ---
>
> Key: SPARK-16106
> URL: https://issues.apache.org/jira/browse/SPARK-16106
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Priority: Trivial
>
> The TaskSchedulerImpl updates the set of executors and hosts in each call to 
> {{resourceOffers}}.  During this call, it also tracks whether there are any 
> new executors observed in {{newExecAvail}}:
> {code}
>   executorIdToHost(o.executorId) = o.host
>   executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
>   if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
> {code}
> However, this only detects when a new *host* is added, not when an additional 
> executor is added to an existing host (a relatively common event in dynamic 
> allocation).
> The end result is that task locality and {{failedEpochs}} is not updated 
> correctly for new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts

2016-06-21 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-16106:
-
Priority: Trivial  (was: Major)

> TaskSchedulerImpl does not correctly handle new executors on existing hosts
> ---
>
> Key: SPARK-16106
> URL: https://issues.apache.org/jira/browse/SPARK-16106
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Priority: Trivial
>
> The TaskSchedulerImpl updates the set of executors and hosts in each call to 
> {{resourceOffers}}.  During this call, it also tracks whether there are any 
> new executors observed in {{newExecAvail}}:
> {code}
>   executorIdToHost(o.executorId) = o.host
>   executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
>   if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
> {code}
> However, this only detects when a new *host* is added, not when an additional 
> executor is added to an existing host (a relatively common event in dynamic 
> allocation).
> The end result is that task locality and {{failedEpochs}} is not updated 
> correctly for new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15606) Driver hang in o.a.s.DistributedSuite on 2 core machine

2016-06-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-15606:
-
Fix Version/s: 1.6.2

> Driver hang in o.a.s.DistributedSuite on 2 core machine
> ---
>
> Key: SPARK-15606
> URL: https://issues.apache.org/jira/browse/SPARK-15606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: AMD64 box with only 2 cores
>Reporter: Pete Robbins
>Assignee: Pete Robbins
> Fix For: 1.6.2, 2.0.0
>
>
> repeatedly failing task that crashes JVM *** FAILED ***
>   The code passed to failAfter did not complete within 10 milliseconds. 
> (DistributedSuite.scala:128)
> This test started failing and DistrbutedSuite hanging following 
> https://github.com/apache/spark/pull/13055
> It looks like the extra message to remove the BlockManager deadlocks as there 
> are only 2 message processing loop threads. Related to 
> https://issues.apache.org/jira/browse/SPARK-13906
> {code}
>   /** Thread pool used for dispatching messages. */
>   private val threadpool: ThreadPoolExecutor = {
> val numThreads = 
> nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
>   math.max(2, Runtime.getRuntime.availableProcessors()))
> val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, 
> "dispatcher-event-loop")
> for (i <- 0 until numThreads) {
>   pool.execute(new MessageLoop)
> }
> pool
>   }
> {code} 
> Setting a minimum of 3 threads alleviates this issue but I'm not sure there 
> isn't another underlying problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15606) Driver hang in o.a.s.DistributedSuite on 2 core machine

2016-06-21 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-15606:
-
Affects Version/s: 1.6.2

> Driver hang in o.a.s.DistributedSuite on 2 core machine
> ---
>
> Key: SPARK-15606
> URL: https://issues.apache.org/jira/browse/SPARK-15606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: AMD64 box with only 2 cores
>Reporter: Pete Robbins
>Assignee: Pete Robbins
> Fix For: 1.6.2, 2.0.0
>
>
> repeatedly failing task that crashes JVM *** FAILED ***
>   The code passed to failAfter did not complete within 10 milliseconds. 
> (DistributedSuite.scala:128)
> This test started failing and DistrbutedSuite hanging following 
> https://github.com/apache/spark/pull/13055
> It looks like the extra message to remove the BlockManager deadlocks as there 
> are only 2 message processing loop threads. Related to 
> https://issues.apache.org/jira/browse/SPARK-13906
> {code}
>   /** Thread pool used for dispatching messages. */
>   private val threadpool: ThreadPoolExecutor = {
> val numThreads = 
> nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
>   math.max(2, Runtime.getRuntime.availableProcessors()))
> val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, 
> "dispatcher-event-loop")
> for (i <- 0 until numThreads) {
>   pool.execute(new MessageLoop)
> }
> pool
>   }
> {code} 
> Setting a minimum of 3 threads alleviates this issue but I'm not sure there 
> isn't another underlying problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it

2016-06-21 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-16119:
--

 Summary: Support "DROP TABLE ... PURGE" if Hive client supports it
 Key: SPARK-16119
 URL: https://issues.apache.org/jira/browse/SPARK-16119
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


There's currently code that explicitly disables the "PURGE" flag when dropping 
a table:

{code}
if (ctx.PURGE != null) {
  throw operationNotAllowed("DROP TABLE ... PURGE", ctx)
}
{code}

That flag is necessary in certain situations where the table data cannot be 
moved to the trash (which will be tried unless "PURGE" is requested). If the 
client supports it (Hive >= 0.14.0 according to the Hive docs), we should allow 
that option to be defined.

For non-Hive tables, as far as I can understand, "PURGE" is the current 
behavior of Spark.

The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so 
should probably be covered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16115:


Assignee: (was: Apache Spark)

> Improve output column name for SHOW PARTITIONS command and improve an error 
> message
> ---
>
> Key: SPARK-16115
> URL: https://issues.apache.org/jira/browse/SPARK-16115
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue to address the following:
> 1. For the SHOW PARTITIONS command, the column name in the output is 
> 'result'.   In Hive, the column name is 'partition' which is more 
> descriptive. Change the output schema, column name from 'result' to 
> 'partition'
> 2. Corner case: Improve the error message for a non-existent table
> Right now, the below error is thrown:
>  scala> spark.sql("show partitions t1");
>  org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on 
> a table that is not partitioned: default.t1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16115:


Assignee: Apache Spark

> Improve output column name for SHOW PARTITIONS command and improve an error 
> message
> ---
>
> Key: SPARK-16115
> URL: https://issues.apache.org/jira/browse/SPARK-16115
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sunitha Kambhampati
>Assignee: Apache Spark
>Priority: Minor
>
> Opening this issue to address the following:
> 1. For the SHOW PARTITIONS command, the column name in the output is 
> 'result'.   In Hive, the column name is 'partition' which is more 
> descriptive. Change the output schema, column name from 'result' to 
> 'partition'
> 2. Corner case: Improve the error message for a non-existent table
> Right now, the below error is thrown:
>  scala> spark.sql("show partitions t1");
>  org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on 
> a table that is not partitioned: default.t1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342796#comment-15342796
 ] 

Apache Spark commented on SPARK-16115:
--

User 'skambha' has created a pull request for this issue:
https://github.com/apache/spark/pull/13822

> Improve output column name for SHOW PARTITIONS command and improve an error 
> message
> ---
>
> Key: SPARK-16115
> URL: https://issues.apache.org/jira/browse/SPARK-16115
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue to address the following:
> 1. For the SHOW PARTITIONS command, the column name in the output is 
> 'result'.   In Hive, the column name is 'partition' which is more 
> descriptive. Change the output schema, column name from 'result' to 
> 'partition'
> 2. Corner case: Improve the error message for a non-existent table
> Right now, the below error is thrown:
>  scala> spark.sql("show partitions t1");
>  org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on 
> a table that is not partitioned: default.t1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342790#comment-15342790
 ] 

Apache Spark commented on SPARK-16118:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/13821

> getDropLast is missing in OneHotEncoder
> ---
>
> Key: SPARK-16118
> URL: https://issues.apache.org/jira/browse/SPARK-16118
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16118:


Assignee: Xiangrui Meng  (was: Apache Spark)

> getDropLast is missing in OneHotEncoder
> ---
>
> Key: SPARK-16118
> URL: https://issues.apache.org/jira/browse/SPARK-16118
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16118:


Assignee: Apache Spark  (was: Xiangrui Meng)

> getDropLast is missing in OneHotEncoder
> ---
>
> Key: SPARK-16118
> URL: https://issues.apache.org/jira/browse/SPARK-16118
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-16118:
-

 Summary: getDropLast is missing in OneHotEncoder
 Key: SPARK-16118
 URL: https://issues.apache.org/jira/browse/SPARK-16118
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.6.1, 1.5.2, 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342773#comment-15342773
 ] 

Apache Spark commented on SPARK-16107:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/13820

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
>  Labels: starter
>
> Group API docs of spark.glm, glm, predict(GLM), summary(GLM), 
> read/write.ml(GLM) under Rd spark.glm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16107:


Assignee: Junyang Qian  (was: Apache Spark)

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
>  Labels: starter
>
> Group API docs of spark.glm, glm, predict(GLM), summary(GLM), 
> read/write.ml(GLM) under Rd spark.glm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16107:


Assignee: Apache Spark  (was: Junyang Qian)

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: starter
>
> Group API docs of spark.glm, glm, predict(GLM), summary(GLM), 
> read/write.ml(GLM) under Rd spark.glm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16117) Hide LibSVMFileFormat in public API docs

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342741#comment-15342741
 ] 

Apache Spark commented on SPARK-16117:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/13819

> Hide LibSVMFileFormat in public API docs
> 
>
> Key: SPARK-16117
> URL: https://issues.apache.org/jira/browse/SPARK-16117
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> LibSVMFileFormat implements data source for LIBSVM format. However, users do 
> not need to call its APIs to use it. So we should hide it in the public API 
> docs. The main issue is that we still need to put the documentation and 
> example code somewhere. The proposal it to have a dummy object to hold the 
> documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16117) Hide LibSVMFileFormat in public API docs

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16117:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Hide LibSVMFileFormat in public API docs
> 
>
> Key: SPARK-16117
> URL: https://issues.apache.org/jira/browse/SPARK-16117
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> LibSVMFileFormat implements data source for LIBSVM format. However, users do 
> not need to call its APIs to use it. So we should hide it in the public API 
> docs. The main issue is that we still need to put the documentation and 
> example code somewhere. The proposal it to have a dummy object to hold the 
> documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16117) Hide LibSVMFileFormat in public API docs

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16117:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Hide LibSVMFileFormat in public API docs
> 
>
> Key: SPARK-16117
> URL: https://issues.apache.org/jira/browse/SPARK-16117
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> LibSVMFileFormat implements data source for LIBSVM format. However, users do 
> not need to call its APIs to use it. So we should hide it in the public API 
> docs. The main issue is that we still need to put the documentation and 
> example code somewhere. The proposal it to have a dummy object to hold the 
> documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-21 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342731#comment-15342731
 ] 

Michael Allman commented on SPARK-15968:


I've created a revised PR for this issue, 
https://github.com/apache/spark/pull/13818.

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>  Labels: hive, metastore
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for non-empty partitioned 
> tables. As a result, cache lookups on non-empty partitioned tables always 
> miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342730#comment-15342730
 ] 

Apache Spark commented on SPARK-15968:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/13818

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>  Labels: hive, metastore
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for non-empty partitioned 
> tables. As a result, cache lookups on non-empty partitioned tables always 
> miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342728#comment-15342728
 ] 

Apache Spark commented on SPARK-15997:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/13745

> Audit ml.feature Update documentation for ml feature transformers
> -
>
> Key: SPARK-15997
> URL: https://issues.apache.org/jira/browse/SPARK-15997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Gayathri Murali
>Assignee: Gayathri Murali
>
> This JIRA is a subtask of SPARK-15100 and improves documentation for new 
> features added to 
> 1. HashingTF
> 2. Countvectorizer
> 3. QuantileDiscretizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16116) ConsoleSink should not require checkpointLocation

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16116:


Assignee: Shixiong Zhu  (was: Apache Spark)

> ConsoleSink should not require checkpointLocation
> -
>
> Key: SPARK-16116
> URL: https://issues.apache.org/jira/browse/SPARK-16116
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> ConsoleSink should not require checkpointLocation since it's for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16116) ConsoleSink should not require checkpointLocation

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342708#comment-15342708
 ] 

Apache Spark commented on SPARK-16116:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/13817

> ConsoleSink should not require checkpointLocation
> -
>
> Key: SPARK-16116
> URL: https://issues.apache.org/jira/browse/SPARK-16116
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> ConsoleSink should not require checkpointLocation since it's for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16116) ConsoleSink should not require checkpointLocation

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16116:


Assignee: Apache Spark  (was: Shixiong Zhu)

> ConsoleSink should not require checkpointLocation
> -
>
> Key: SPARK-16116
> URL: https://issues.apache.org/jira/browse/SPARK-16116
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> ConsoleSink should not require checkpointLocation since it's for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16117) Hide LibSVMFileFormat in public API docs

2016-06-21 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-16117:
-

 Summary: Hide LibSVMFileFormat in public API docs
 Key: SPARK-16117
 URL: https://issues.apache.org/jira/browse/SPARK-16117
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


LibSVMFileFormat implements data source for LIBSVM format. However, users do 
not need to call its APIs to use it. So we should hide it in the public API 
docs. The main issue is that we still need to put the documentation and example 
code somewhere. The proposal it to have a dummy object to hold the 
documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16116) ConsoleSink should not require checkpointLocation

2016-06-21 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-16116:


 Summary: ConsoleSink should not require checkpointLocation
 Key: SPARK-16116
 URL: https://issues.apache.org/jira/browse/SPARK-16116
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


ConsoleSink should not require checkpointLocation since it's for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16114) Add network word count example

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16114:


Assignee: (was: Apache Spark)

> Add network word count example
> --
>
> Key: SPARK-16114
> URL: https://issues.apache.org/jira/browse/SPARK-16114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: James Thomas
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342696#comment-15342696
 ] 

Felix Cheung edited comment on SPARK-16090 at 6/21/16 9:08 PM:
---

This is for example the html output for gapply

{code}
# S4 method for signature 'GroupedData'
gapply(x, func, schema)

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)



Arguments


x

a GroupedData

func

A function to be applied to each group partition specified by GroupedData.
The function 'func' takes as argument a key - grouping columns and
a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.

cols

Grouping columns

x

A SparkDataFrame

func

A function to be applied to each group partition specified by grouping
column of the SparkDataFrame. The function 'func' takes as argument
a key - grouping columns and a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.


{code}

As you can see, func and schema (and x) are listed twice with different wording 
under Arguments.
We should see if we could explain it one way and list them once only. (ie. one 
copy of "@param func")



was (Author: felixcheung):
This is for example the html output for gapply

{code}
# S4 method for signature 'GroupedData'
gapply(x, func, schema)

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)



Arguments


x

a GroupedData

func

A function to be applied to each group partition specified by GroupedData.
The function 'func' takes as argument a key - grouping columns and
a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.

cols

Grouping columns

x

A SparkDataFrame

func

A function to be applied to each group partition specified by grouping
column of the SparkDataFrame. The function 'func' takes as argument
a key - grouping columns and a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.


{code}

As you can see, func and schema are listed twice with different wording under 
Arguments.
We should see if we could explain it one way and list them once only. (ie. one 
copy of "@param func")


> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16114) Add network word count example

2016-06-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342700#comment-15342700
 ] 

Apache Spark commented on SPARK-16114:
--

User 'jjthomas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13816

> Add network word count example
> --
>
> Key: SPARK-16114
> URL: https://issues.apache.org/jira/browse/SPARK-16114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: James Thomas
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16114) Add network word count example

2016-06-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16114:


Assignee: Apache Spark

> Add network word count example
> --
>
> Key: SPARK-16114
> URL: https://issues.apache.org/jira/browse/SPARK-16114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: James Thomas
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342696#comment-15342696
 ] 

Felix Cheung commented on SPARK-16090:
--

This is for example the html output for gapply

{code}
# S4 method for signature 'GroupedData'
gapply(x, func, schema)

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)



Arguments


x

a GroupedData

func

A function to be applied to each group partition specified by GroupedData.
The function 'func' takes as argument a key - grouping columns and
a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.

cols

Grouping columns

x

A SparkDataFrame

func

A function to be applied to each group partition specified by grouping
column of the SparkDataFrame. The function 'func' takes as argument
a key - grouping columns and a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.


{code}

As you can see, func and schema are listed twice with different wording under 
Arguments.
We should see if we could explain it one way and list them once only. (ie. one 
copy of "@param func")


> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16105) PCA Reverse Transformer

2016-06-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16105:
--
   Flags:   (was: Important)
Priority: Minor  (was: Major)

The transformation is to a lower-dimensional space, so it's not possible in 
general to reverse it. You can, but you still end up with a 15-dimensional 
subspace of 96-dimensional space in your example. The matrix is all the 
information available to make this transformation and it's available already.

> PCA Reverse Transformer
> ---
>
> Key: SPARK-16105
> URL: https://issues.apache.org/jira/browse/SPARK-16105
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Stefan Panayotov
>Priority: Minor
>
> The PCA class has a fit method that returns a PCAModel. One of the members of 
> the PCAModel is a pc (Principal Components Matrix). This matrix is available 
> for inspection, but there is no method to use this matrix for reverse 
> transformation back to the original dimension. For example, if I use the PCA 
> to reduce dimensionality of my space from 96 to 15, I get a 96x15 pc Matrix. 
> I can do some modeling in my reduced space and then I need to  reverse back 
> to the original 96 dimensional space. Basically, I need to multiply my 15 
> dimensional vectors by the 96x15 pc Matrix to get back 96 dimensional 
> vectors. Such method is missing from the PCA model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message

2016-06-21 Thread Sunitha Kambhampati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342689#comment-15342689
 ] 

Sunitha Kambhampati commented on SPARK-16115:
-

I will submit a PR shortly.

> Improve output column name for SHOW PARTITIONS command and improve an error 
> message
> ---
>
> Key: SPARK-16115
> URL: https://issues.apache.org/jira/browse/SPARK-16115
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue to address the following:
> 1. For the SHOW PARTITIONS command, the column name in the output is 
> 'result'.   In Hive, the column name is 'partition' which is more 
> descriptive. Change the output schema, column name from 'result' to 
> 'partition'
> 2. Corner case: Improve the error message for a non-existent table
> Right now, the below error is thrown:
>  scala> spark.sql("show partitions t1");
>  org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on 
> a table that is not partitioned: default.t1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message

2016-06-21 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-16115:

Description: 
Opening this issue to address the following:

1. For the SHOW PARTITIONS command, the column name in the output is 'result'.  
 In Hive, the column name is 'partition' which is more descriptive. Change the 
output schema, column name from 'result' to 'partition'

2. Corner case: Improve the error message for a non-existent table

Right now, the below error is thrown:
 scala> spark.sql("show partitions t1");
 org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on a 
table that is not partitioned: default.t1;


  was:
Opening this issue to address the following:

1. For the SHOW PARTITIONS command, the column name in the output is 'result'.  
 In Hive, the column name is 'partition' which is more descriptive. 

2. Corner case: Improve the error message for a non-existent table

Right now, the below error is thrown:
 scala> spark.sql("show partitions t1");
 org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on a 
table that is not partitioned: default.t1;



> Improve output column name for SHOW PARTITIONS command and improve an error 
> message
> ---
>
> Key: SPARK-16115
> URL: https://issues.apache.org/jira/browse/SPARK-16115
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue to address the following:
> 1. For the SHOW PARTITIONS command, the column name in the output is 
> 'result'.   In Hive, the column name is 'partition' which is more 
> descriptive. Change the output schema, column name from 'result' to 
> 'partition'
> 2. Corner case: Improve the error message for a non-existent table
> Right now, the below error is thrown:
>  scala> spark.sql("show partitions t1");
>  org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not   allowed on 
> a table that is not partitioned: default.t1;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 368 matches

Mail list logo