[jira] [Commented] (SPARK-16095) Yarn cluster mode should return consistent result for command line and SparkLauncher
[ https://issues.apache.org/jira/browse/SPARK-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343760#comment-15343760 ] Peng Zhang commented on SPARK-16095: I want to make a unit test to explain this issue, but found YarnClusterSuite doesn't run into block * if (conf.get("spark.master") == "yarn-cluster") *. And I filed SPARK-16125 to fix it first. > Yarn cluster mode should return consistent result for command line and > SparkLauncher > > > Key: SPARK-16095 > URL: https://issues.apache.org/jira/browse/SPARK-16095 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Peng Zhang > > For an application with YarnApplicationState.FINISHED and > FinalApplicationStatus.FAILED, invoking spark-submit from command line will > got Exception, submit with SparkLauncher will got state with FINISHED which > means app succeeded. > Also because the above fact, in test YarnClusterSuite, assert with false > condition will not fail the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16125) YarnClusterSuite test cluster mode incorrectly
[ https://issues.apache.org/jira/browse/SPARK-16125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343747#comment-15343747 ] Sean Owen commented on SPARK-16125: --- Looks like we might have related issues in UtilsSuite, context.py and SparkILoop; can you adjust those too? > YarnClusterSuite test cluster mode incorrectly > -- > > Key: SPARK-16125 > URL: https://issues.apache.org/jira/browse/SPARK-16125 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Peng Zhang >Priority: Minor > > In YarnClusterSuite, it test cluster mode with: > {code} > if (conf.get("spark.master") == "yarn-cluster") { > {code} > But since SPARK-13220 change, conf.get("spark.master") will get "yarn". So > this *if* condition will always be *false*, and coding in this block will > never be executed. I thinks it should change to: > {code} > if (conf.get("spark.submit.deployMode") == "cluster") { > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16064) Fix the GLM error caused by NA produced by reweight function
[ https://issues.apache.org/jira/browse/SPARK-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343745#comment-15343745 ] Sean Owen commented on SPARK-16064: --- Based on what though? you should compile the information publicly > Fix the GLM error caused by NA produced by reweight function > > > Key: SPARK-16064 > URL: https://issues.apache.org/jira/browse/SPARK-16064 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: Zhang Mengqi >Priority: Minor > > This case happens when users run GLM in with SparkR, the same dataset runs > GLM well in native R. > When users run the GLM model using glm with family of poisson, it generates a > assertion errors by NA produced by reweight function. > 16/06/20 16:40:22 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.AssertionError: assertion failed: Sum of weights cannot be zero. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.ml.optim.WeightedLeastSquares$Aggregator.validate(WeightedLeastSquares.scala:248) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:82) > at > org.apache.spark.ml.optim.IterativelyReweightedLeastSquares.fit(IterativelyReweightedLeastSquares.scala:85) > at > org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:276) > at > org.apache.spark.ml.regression.GeneralizedLinearRegression.train(GeneralizedLinearRegression.scala:134) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:148) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:144) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.Abstra > P.S The dataset is about a city ride flow between several planning area in > Singapore. > ride_flow_exp <- glm(flow~Origin+Destination+distance,ride_flow,family = > poisson(link = "log")) > SparkDataFrame[Origin:string, Destination:string, flow:double, Oi:int, > Dj:int, distance:double] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16125) YarnClusterSuite test cluster mode incorrectly
Peng Zhang created SPARK-16125: -- Summary: YarnClusterSuite test cluster mode incorrectly Key: SPARK-16125 URL: https://issues.apache.org/jira/browse/SPARK-16125 Project: Spark Issue Type: Bug Components: YARN Reporter: Peng Zhang Priority: Minor In YarnClusterSuite, it test cluster mode with: {code} if (conf.get("spark.master") == "yarn-cluster") { {code} But since SPARK-13220 change, conf.get("spark.master") will get "yarn". So this *if* condition will always be *false*, and coding in this block will never be executed. I thinks it should change to: {code} if (conf.get("spark.submit.deployMode") == "cluster") { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16024) add tests for table creation with column comment
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16024: Description: (was: CREATE TABLE src(a INT COMMENT 'bla') USING parquet. When we describe table, the column comment is not there.) > add tests for table creation with column comment > > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16024) add tests for table creation with column comment
[ https://issues.apache.org/jira/browse/SPARK-16024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-16024: Description: should test both hive serde tables and datasource tables > add tests for table creation with column comment > > > Key: SPARK-16024 > URL: https://issues.apache.org/jira/browse/SPARK-16024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan > > should test both hive serde tables and datasource tables -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16104) Do not creaate CSV writer object for every flush when writing
[ https://issues.apache.org/jira/browse/SPARK-16104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16104: --- Assignee: Hyukjin Kwon > Do not creaate CSV writer object for every flush when writing > - > > Key: SPARK-16104 > URL: https://issues.apache.org/jira/browse/SPARK-16104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Initially, CSV data source creates {{CsvWriter}} for each record but it was > fixed in SPARK-14031. > However, it still creates a writer for each flush in {{LineCsvWriter}}. This > is not necessary. It might be better if it uses single {{CsvWriter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16104) Do not creaate CSV writer object for every flush when writing
[ https://issues.apache.org/jira/browse/SPARK-16104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-16104. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 13809 [https://github.com/apache/spark/pull/13809] > Do not creaate CSV writer object for every flush when writing > - > > Key: SPARK-16104 > URL: https://issues.apache.org/jira/browse/SPARK-16104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon > Fix For: 2.1.0 > > > Initially, CSV data source creates {{CsvWriter}} for each record but it was > fixed in SPARK-14031. > However, it still creates a writer for each flush in {{LineCsvWriter}}. This > is not necessary. It might be better if it uses single {{CsvWriter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12173) Consider supporting DataSet API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343699#comment-15343699 ] Reynold Xin commented on SPARK-12173: - I don't think you are looking for the dataset API. You are just looking for some way to do UDF / aggregation? > Consider supporting DataSet API in SparkR > - > > Key: SPARK-12173 > URL: https://issues.apache.org/jira/browse/SPARK-12173 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16041: Description: Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in DataFrameWriter. The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. was: Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in DataFrameWriter. The duplicate columns could cause unpredictable results. For example, the resolution failure. We should detect the duplicates and issue exceptions with appropriate messages. > Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy` > > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `bucketBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy`
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16041: Summary: Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy` (was: Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy`) > Disallow Duplicate Columns in `partitionBy`, `bucketBy` and `sortBy` > > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16041) Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy`
[ https://issues.apache.org/jira/browse/SPARK-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16041: Summary: Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` (was: Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` in DataFrameWriter) > Disallow Duplicate Columns in `partitionBy`, `blockBy` and `sortBy` > --- > > Key: SPARK-16041 > URL: https://issues.apache.org/jira/browse/SPARK-16041 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Duplicate columns are not allowed in `partitionBy`, `blockBy`, `sortBy` in > DataFrameWriter. The duplicate columns could cause unpredictable results. For > example, the resolution failure. > We should detect the duplicates and issue exceptions with appropriate > messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16100) Aggregator fails with Tungsten error when complex types are used for results and partial sum
[ https://issues.apache.org/jira/browse/SPARK-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343656#comment-15343656 ] Apache Spark commented on SPARK-16100: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13835 > Aggregator fails with Tungsten error when complex types are used for results > and partial sum > > > Key: SPARK-16100 > URL: https://issues.apache.org/jira/browse/SPARK-16100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Deenar Toraskar > > I get a similar error when using complex types in Aggregator. Not sure if > this is the same issue or something else. > {code:Agg.scala} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.TypedColumn > import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder > import org.apache.spark.sql.expressions.Aggregator > import org.apache.spark.sql.{Encoder,Row} > import sqlContext.implicits._ > object CustomSummer extends Aggregator[Valuation, Map[Int, Seq[Double]], > Seq[Seq[Double]]] with Serializable { > def zero: Map[Int, Seq[Double]] = Map() > def reduce(b: Map[Int, Seq[Double]], a:Valuation): Map[Int, Seq[Double]] > = { >val timeInterval: Int = a.timeInterval >val currentSum: Seq[Double] = b.get(timeInterval).getOrElse(Nil) >val currentRow: Seq[Double] = a.pvs >b.updated(timeInterval, sumArray(currentSum, currentRow)) > } > def sumArray(a: Seq[Double], b: Seq[Double]): Seq[Double] = Nil > def merge(b1: Map[Int, Seq[Double]], b2: Map[Int, Seq[Double]]): > Map[Int, Seq[Double]] = { > /* merges two maps together ++ replaces any (k,v) from the map on the > left > side of ++ (here map1) by (k,v) from the right side map, if (k,_) > already > exists in the left side map (here map1), e.g. Map(1->1) ++ Map(1->2) > results in Map(1->2) */ > b1 ++ b2.map { case (timeInterval, exposures) => > timeInterval -> sumArray(exposures, b1.getOrElse(timeInterval, Nil)) > } > } > def finish(exposures: Map[Int, Seq[Double]]): Seq[Seq[Double]] = > { > exposures.size match { > case 0 => null > case _ => { > val range = exposures.keySet.max > // convert map to 2 dimensional array, (timeInterval x > Seq[expScn1, expScn2, ...] > (0 to range).map(x => exposures.getOrElse(x, Nil)) > } > } > } > override def bufferEncoder: Encoder[Map[Int,Seq[Double]]] = > ExpressionEncoder() > override def outputEncoder: Encoder[Seq[Seq[Double]]] = ExpressionEncoder() >} > case class Valuation(timeInterval : Int, pvs : Seq[Double]) > val valns = sc.parallelize(Seq(Valuation(0, Seq(1.0,2.0,3.0)), > Valuation(2, Seq(1.0,2.0,3.0)), > Valuation(1, Seq(1.0,2.0,3.0)),Valuation(2, Seq(1.0,2.0,3.0)),Valuation(0, > Seq(1.0,2.0,3.0)) > )).toDS > val g_c1 = > valns.groupByKey(_.timeInterval).agg(CustomSummer.toColumn).show(false) > {code} > I get the following error > {quote} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 19, localhost): java.lang.IndexOutOfBoundsException: 0 > at > scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43) > at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) > at scala.collection.mutable.ArrayBuffer.remove(ArrayBuffer.scala:167) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:154) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:155) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:155) > at >
[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343243#comment-15343243 ] Ryan Blue commented on SPARK-16032: --- [~cloud_fan], while I think by-name insertion is important in the long run, my concern on this issue isn't that it is missing. It would be nice and it would help mitigate what I think is a problem for people that currently use the Hive integration. But, the point I'm trying to make is that I think this set of changes makes it very difficult to use Spark with Hive. The first thing I will do to move on to Spark 2.0 is restore support for partitionBy with insertInto. I need to make a good case for that and write it up more clearly than I have so far, which I'll do tomorrow. But I want to let you guys know that I see these changes as a major blocker for us and probably others. My -1 isn't binding and I realize there are deadlines that you guys are aiming for. I'm not trying to mess that up, I just want to let you know that I think the Hive changes need more discussion. Consequently, I would rather they wait until 2.1, even if we have to deal with the incompatibility. An aside on the by-name resolution: I noted above that, rather than changing Hive/saveAsTable, we could use the InsertIntoTable with by-name resolution from my PR. My point isn't to try to push that feature into the release, it's that I think it supports the point that this is a big change set that was done in very little time. It doesn't look like there was a significant change around saveAsTable for Hive in the actual code, though. > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Critical > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343235#comment-15343235 ] Gayathri Murali commented on SPARK-16000: - [~yuhaoyan] I can help with this. > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: yuhao yang > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343232#comment-15343232 ] Xiao Li commented on SPARK-16121: - I also saw this, but I thought this is by design. : ) > ListingFileCatalog does not list in parallel anymore > > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343215#comment-15343215 ] Sun Rui commented on SPARK-12172: - currently spark.lapply() internally depends on RDD, we have to change this before removing RDD > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12173) Consider supporting DataSet API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343212#comment-15343212 ] Sun Rui commented on SPARK-12173: - [~rxin] yes R don't need compile time type safety, but map/reduce functions are popular in R, for example lapply() applies a function to each item of a list or vector. For now, sparkR support spark.lapply() similar to lapply(). The internal implementation internally depends on RDD. We could change the implementation to use Dataset but not exposing Dataset API, something like: change the R vector/list to a Dataset call Dataset functions on it Collect the result back as R vector/list Not exposing Dataset API means SparkR does not provides distributed vector/list abstraction, SparkR users have to use DataFrame for distributed vector/list , which seems is not convenient to R users. [~shivaram] what do you think? > Consider supporting DataSet API in SparkR > - > > Key: SPARK-12173 > URL: https://issues.apache.org/jira/browse/SPARK-12173 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`
[ https://issues.apache.org/jira/browse/SPARK-16124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343176#comment-15343176 ] Apache Spark commented on SPARK-16124: -- User 'tilumi' has created a pull request for this issue: https://github.com/apache/spark/pull/13833 > Throws exception when executing query on `build/sbt hive/console` > - > > Key: SPARK-16124 > URL: https://issues.apache.org/jira/browse/SPARK-16124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: MIN-FU YANG >Priority: Minor > > When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive > console which is from `build/sbt hive/console`, It throws exception: > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 42 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan
[ https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-15326. - Resolution: Not A Problem > Doing multiple unions on a Dataframe will result in a very inefficient query > plan > - > > Key: SPARK-15326 > URL: https://issues.apache.org/jira/browse/SPARK-15326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Jurriaan Pruis > Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt > > > While working with a very skewed dataset I noticed that repeated unions on a > dataframe will result in a query plan with 2^(union) - 1 unions. With large > datasets this will be very inefficient. > I tried to replicate this behaviour using a PySpark example (I've attached > the output of the explain() to this JIRA): > {code} > df = sqlCtx.range(1000) > def r(name, max_val=100): > return F.round(F.lit(max_val) * F.pow(F.rand(), > 4)).cast('integer').alias(name) > # Create a skewed dataset > skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f')) > # Find the skewed values in the dataset > top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], > 0.10).collect()[0] > def skewjoin(skewed, right, column, freqItems): > freqItems = freqItems[column + '_freqItems'] > skewed = skewed.alias('skewed') > cond = F.col(column).isin(freqItems) > # First broadcast join the frequent (skewed) values > filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), > column, 'left_outer') > # Use a regular join for the non skewed values (with big tables this will > use a SortMergeJoin) > non_skewed = skewed.filter(cond == False).join(right.filter(cond == > False), column, 'left_outer') > # join them together and replace the column with the column found in the > right DataFrame > return filtered.unionAll(non_skewed).select('skewed.*', > right['id'].alias(column + '_key')).drop(column) > # Create the dataframes that will be joined to the skewed dataframe > right_size = 100 > df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a')) > df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b')) > df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c')) > df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d')) > df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e')) > df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f')) > # Join everything together > df = skewed > df = skewjoin(df, df_a, 'a', top_10_percent) > df = skewjoin(df, df_b, 'b', top_10_percent) > df = skewjoin(df, df_c, 'c', top_10_percent) > df = skewjoin(df, df_d, 'd', top_10_percent) > df = skewjoin(df, df_e, 'e', top_10_percent) > df = skewjoin(df, df_f, 'f', top_10_percent) > # df.explain() shows the plan where it does 63 unions > (2^(number_of_skewjoins) - 1) > # which will be very inefficient and slow > df.explain(True) > # Evaluate the plan > # You'd expect this to return 1000, but it does not, it returned 1140 > on my system > # (probably because it will recalculate the random columns? Not sure though) > print(df.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`
[ https://issues.apache.org/jira/browse/SPARK-16124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16124: Assignee: Apache Spark > Throws exception when executing query on `build/sbt hive/console` > - > > Key: SPARK-16124 > URL: https://issues.apache.org/jira/browse/SPARK-16124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: MIN-FU YANG >Assignee: Apache Spark >Priority: Minor > > When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive > console which is from `build/sbt hive/console`, It throws exception: > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 42 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`
[ https://issues.apache.org/jira/browse/SPARK-16124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16124: Assignee: (was: Apache Spark) > Throws exception when executing query on `build/sbt hive/console` > - > > Key: SPARK-16124 > URL: https://issues.apache.org/jira/browse/SPARK-16124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: MIN-FU YANG >Priority: Minor > > When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive > console which is from `build/sbt hive/console`, It throws exception: > org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: > file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt; > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at > org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462) > at > org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 42 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan
[ https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343175#comment-15343175 ] Herman van Hovell commented on SPARK-15326: --- So we have code that will flatten nested Unions. For example {{UNION(UNION(a, b), UNION(c, d))}} will be converted into {{UNION(a, b, c, d}}. This case however is a bit different. You are joining the {{right}} table twice here once to the skewed values and once to the regular values, and then you are joining them back together. So you are incorporating the right hand side twice - which will cause a nice blow-up if you do this a few times. You may be able to rewrite without the union by doing something in the line of {{select * from right left join broadcast(skewed) left join not-skewed}}. > Doing multiple unions on a Dataframe will result in a very inefficient query > plan > - > > Key: SPARK-15326 > URL: https://issues.apache.org/jira/browse/SPARK-15326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Jurriaan Pruis > Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt > > > While working with a very skewed dataset I noticed that repeated unions on a > dataframe will result in a query plan with 2^(union) - 1 unions. With large > datasets this will be very inefficient. > I tried to replicate this behaviour using a PySpark example (I've attached > the output of the explain() to this JIRA): > {code} > df = sqlCtx.range(1000) > def r(name, max_val=100): > return F.round(F.lit(max_val) * F.pow(F.rand(), > 4)).cast('integer').alias(name) > # Create a skewed dataset > skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f')) > # Find the skewed values in the dataset > top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], > 0.10).collect()[0] > def skewjoin(skewed, right, column, freqItems): > freqItems = freqItems[column + '_freqItems'] > skewed = skewed.alias('skewed') > cond = F.col(column).isin(freqItems) > # First broadcast join the frequent (skewed) values > filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), > column, 'left_outer') > # Use a regular join for the non skewed values (with big tables this will > use a SortMergeJoin) > non_skewed = skewed.filter(cond == False).join(right.filter(cond == > False), column, 'left_outer') > # join them together and replace the column with the column found in the > right DataFrame > return filtered.unionAll(non_skewed).select('skewed.*', > right['id'].alias(column + '_key')).drop(column) > # Create the dataframes that will be joined to the skewed dataframe > right_size = 100 > df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a')) > df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b')) > df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c')) > df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d')) > df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e')) > df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f')) > # Join everything together > df = skewed > df = skewjoin(df, df_a, 'a', top_10_percent) > df = skewjoin(df, df_b, 'b', top_10_percent) > df = skewjoin(df, df_c, 'c', top_10_percent) > df = skewjoin(df, df_d, 'd', top_10_percent) > df = skewjoin(df, df_e, 'e', top_10_percent) > df = skewjoin(df, df_f, 'f', top_10_percent) > # df.explain() shows the plan where it does 63 unions > (2^(number_of_skewjoins) - 1) > # which will be very inefficient and slow > df.explain(True) > # Evaluate the plan > # You'd expect this to return 1000, but it does not, it returned 1140 > on my system > # (probably because it will recalculate the random columns? Not sure though) > print(df.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`
MIN-FU YANG created SPARK-16124: --- Summary: Throws exception when executing query on `build/sbt hive/console` Key: SPARK-16124 URL: https://issues.apache.org/jira/browse/SPARK-16124 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: MIN-FU YANG Priority: Minor When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive console which is from `build/sbt hive/console`, It throws exception: org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128) at org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192) at org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) at org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376) at org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) at org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462) at org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) ... 42 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343029#comment-15343029 ] Wenchen Fan edited comment on SPARK-16032 at 6/22/16 1:15 AM: -- I think it doesn't make sense to use `partitionBy` with `insertInto`, as we can not map `DataFrameWriter.insertInto` to SQL INSERT for 2 reasons: 1. `DataFrameWriter` doesn't support static partition 2. `DataFrameWriter` specifies the partition columns of the data to insert, not the table to be inserted. And it's already broken(mostly) in 1.6, according to the test cases at https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only case that seems reasonable in 1.6 is when the data to insert has same schema with the table to be inserted and the `partitionBy` specifies the correct partition columns. But I think it's worth to break it and make the overall semantics more clear. Maybe we are wrong, it will be good if we come up with a clean semantics to explain the behavior of `DataFrame.insertInto`, but after spent a lot of time on it, we failed, and that's why we wanna make these changes and rush in into 2.0. was (Author: cloud_fan): I think it's nonsense to use `partitionBy` with `insertInto`, as we can not map `DataFrameWriter.insertInto` to SQL INSERT for 2 reasons: 1. `DataFrameWriter` doesn't support static partition 2. `DataFrameWriter` specifies the partition columns of the data to insert, not the table to be inserted. And it's already broken(mostly) in 1.6, according to the test cases at https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only case that seems reasonable in 1.6 is when the data to insert has same schema with the table to be inserted and the `partitionBy` specifies the correct partition columns. But I think it's worth to break it and make the overall semantics more clear. Maybe we are wrong, it will be good if we come up with a clean semantics to explain the behavior of `DataFrame.insertInto`, but after spent a lot of time on it, we failed, and that's why we wanna make these changes and rush in into 2.0. > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Critical > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16123: Assignee: (was: Apache Spark) > Avoid NegativeArraySizeException while reserving additional capacity in > VectorizedColumnReader > -- > > Key: SPARK-16123 > URL: https://issues.apache.org/jira/browse/SPARK-16123 > Project: Spark > Issue Type: Bug >Reporter: Sameer Agarwal > > Both off-heap and on-heap variants of ColumnVector.reserve() can > unfortunately overflow while reserving additional capacity during reads. > {code} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397) > at > org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16123: Assignee: Apache Spark > Avoid NegativeArraySizeException while reserving additional capacity in > VectorizedColumnReader > -- > > Key: SPARK-16123 > URL: https://issues.apache.org/jira/browse/SPARK-16123 > Project: Spark > Issue Type: Bug >Reporter: Sameer Agarwal >Assignee: Apache Spark > > Both off-heap and on-heap variants of ColumnVector.reserve() can > unfortunately overflow while reserving additional capacity during reads. > {code} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397) > at > org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343139#comment-15343139 ] Apache Spark commented on SPARK-16123: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/13832 > Avoid NegativeArraySizeException while reserving additional capacity in > VectorizedColumnReader > -- > > Key: SPARK-16123 > URL: https://issues.apache.org/jira/browse/SPARK-16123 > Project: Spark > Issue Type: Bug >Reporter: Sameer Agarwal > > Both off-heap and on-heap variants of ColumnVector.reserve() can > unfortunately overflow while reserving additional capacity during reads. > {code} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397) > at > org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-16123: --- Description: Both off-heap and on-heap variants of ColumnVector.reserve() can unfortunately overflow while reserving additional capacity during reads. {code} Caused by: java.lang.NegativeArraySizeException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397) at org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675) at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389) at org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169) {code} > Avoid NegativeArraySizeException while reserving additional capacity in > VectorizedColumnReader > -- > > Key: SPARK-16123 > URL: https://issues.apache.org/jira/browse/SPARK-16123 > Project: Spark > Issue Type: Bug >Reporter: Sameer Agarwal > > Both off-heap and on-heap variants of ColumnVector.reserve() can > unfortunately overflow while reserving additional capacity during reads. > {code} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397) > at > org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16102) Use Record API from Univocity rather than current data cast API.
[ https://issues.apache.org/jira/browse/SPARK-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-16102: - Affects Version/s: 2.0.0 > Use Record API from Univocity rather than current data cast API. > > > Key: SPARK-16102 > URL: https://issues.apache.org/jira/browse/SPARK-16102 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > There is Record API for Univocity parser. > This API provides typed data. Spark currently tries to compare and cast each > data. > Using this library should reduce the codes in Spark and maybe improve the > performance. > It seems a benchmark should be proceeded first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
Sameer Agarwal created SPARK-16123: -- Summary: Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader Key: SPARK-16123 URL: https://issues.apache.org/jira/browse/SPARK-16123 Project: Spark Issue Type: Bug Reporter: Sameer Agarwal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121 ] MIN-FU YANG commented on SPARK-14172: - I cannot reproduce it on 1.6.1 either. Could you give more detailed description? > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121 ] MIN-FU YANG edited comment on SPARK-14172 at 6/22/16 12:48 AM: --- I cannot reproduce it on 1.6.1 either. Could you give more detailed description or verify it? was (Author: lucasmf): I cannot reproduce it on 1.6.1 either. Could you give more detailed description? > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it
[ https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16119: Assignee: (was: Apache Spark) > Support "DROP TABLE ... PURGE" if Hive client supports it > - > > Key: SPARK-16119 > URL: https://issues.apache.org/jira/browse/SPARK-16119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > There's currently code that explicitly disables the "PURGE" flag when > dropping a table: > {code} > if (ctx.PURGE != null) { > throw operationNotAllowed("DROP TABLE ... PURGE", ctx) > } > {code} > That flag is necessary in certain situations where the table data cannot be > moved to the trash (which will be tried unless "PURGE" is requested). If the > client supports it (Hive >= 0.14.0 according to the Hive docs), we should > allow that option to be defined. > For non-Hive tables, as far as I can understand, "PURGE" is the current > behavior of Spark. > The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so > should probably also be covered by this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it
[ https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343106#comment-15343106 ] Apache Spark commented on SPARK-16119: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/13831 > Support "DROP TABLE ... PURGE" if Hive client supports it > - > > Key: SPARK-16119 > URL: https://issues.apache.org/jira/browse/SPARK-16119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > There's currently code that explicitly disables the "PURGE" flag when > dropping a table: > {code} > if (ctx.PURGE != null) { > throw operationNotAllowed("DROP TABLE ... PURGE", ctx) > } > {code} > That flag is necessary in certain situations where the table data cannot be > moved to the trash (which will be tried unless "PURGE" is requested). If the > client supports it (Hive >= 0.14.0 according to the Hive docs), we should > allow that option to be defined. > For non-Hive tables, as far as I can understand, "PURGE" is the current > behavior of Spark. > The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so > should probably also be covered by this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it
[ https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16119: Assignee: Apache Spark > Support "DROP TABLE ... PURGE" if Hive client supports it > - > > Key: SPARK-16119 > URL: https://issues.apache.org/jira/browse/SPARK-16119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > There's currently code that explicitly disables the "PURGE" flag when > dropping a table: > {code} > if (ctx.PURGE != null) { > throw operationNotAllowed("DROP TABLE ... PURGE", ctx) > } > {code} > That flag is necessary in certain situations where the table data cannot be > moved to the trash (which will be tried unless "PURGE" is requested). If the > client supports it (Hive >= 0.14.0 according to the Hive docs), we should > allow that option to be defined. > For non-Hive tables, as far as I can understand, "PURGE" is the current > behavior of Spark. > The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so > should probably also be covered by this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16121: Assignee: Apache Spark > ListingFileCatalog does not list in parallel anymore > > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16121: Assignee: (was: Apache Spark) > ListingFileCatalog does not list in parallel anymore > > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
[ https://issues.apache.org/jira/browse/SPARK-16121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343069#comment-15343069 ] Apache Spark commented on SPARK-16121: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13830 > ListingFileCatalog does not list in parallel anymore > > > Key: SPARK-16121 > URL: https://issues.apache.org/jira/browse/SPARK-16121 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > > In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown > below. When the number of user-provided paths is less than the value of > {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we > will not use parallel listing, which is different from what 1.6 does (for > 1.6, if the number of children of any inner dir is larger than the threshold, > we will use the parallel listing). > {code} > protected def listLeafFiles(paths: Seq[Path]): > mutable.LinkedHashSet[FileStatus] = { > if (paths.length >= > sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { > HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, > sparkSession) > } else { > // Dummy jobconf to get to the pathFilter defined in configuration > val jobConf = new JobConf(hadoopConf, this.getClass) > val pathFilter = FileInputFormat.getInputPathFilter(jobConf) > val statuses: Seq[FileStatus] = paths.flatMap { path => > val fs = path.getFileSystem(hadoopConf) > logInfo(s"Listing $path on driver") > Try { > HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), > pathFilter) > }.getOrElse(Array.empty[FileStatus]) > } > mutable.LinkedHashSet(statuses: _*) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16071: Assignee: Apache Spark > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of later throwing NegativeArraySize (which is > slow and might cause silent errors) > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but slow (see SPARK-16043) > * n=2e8: NegativeArraySize exception > {code:none} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > * n=3e8: NegativeArraySize exception but raised at a different location > {code:none} > java.lang.RuntimeException: Error while encoding: > java.lang.NegativeArraySizeException > newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS > value#108 > +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) >+- input[0, [I, true] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at >
[jira] [Assigned] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16071: Assignee: (was: Apache Spark) > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of later throwing NegativeArraySize (which is > slow and might cause silent errors) > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but slow (see SPARK-16043) > * n=2e8: NegativeArraySize exception > {code:none} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > * n=3e8: NegativeArraySize exception but raised at a different location > {code:none} > java.lang.RuntimeException: Error while encoding: > java.lang.NegativeArraySizeException > newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS > value#108 > +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) >+- input[0, [I, true] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at >
[jira] [Commented] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343067#comment-15343067 ] Apache Spark commented on SPARK-16071: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/13829 > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of later throwing NegativeArraySize (which is > slow and might cause silent errors) > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but slow (see SPARK-16043) > * n=2e8: NegativeArraySize exception > {code:none} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > * n=3e8: NegativeArraySize exception but raised at a different location > {code:none} > java.lang.RuntimeException: Error while encoding: > java.lang.NegativeArraySizeException > newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS > value#108 > +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) >+- input[0, [I, true] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at >
[jira] [Created] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application
Neelesh Srinivas Salian created SPARK-16122: --- Summary: Spark History Server REST API missing an environment endpoint per application Key: SPARK-16122 URL: https://issues.apache.org/jira/browse/SPARK-16122 Project: Spark Issue Type: New Feature Components: Documentation, Web UI Affects Versions: 1.6.1 Reporter: Neelesh Srinivas Salian Priority: Minor The WebUI for the Spark History Server has the Environment tab that allows you to view the Environment for that job. With Runtime , Spark properties...etc. How about adding an endpoint to the REST API that looks and points to this environment tab for that application? /applications/[app-id]/environment Added Docs too so that we can spawn a subsequent Documentation addition to get it included in the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update
[ https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343059#comment-15343059 ] Joseph K. Bradley commented on SPARK-15643: --- A few more deprecations to add from current PRs: * [https://github.com/apache/spark/pull/13823] * [https://github.com/apache/spark/pull/13380] > ML 2.0 QA: migration guide update > - > > Key: SPARK-15643 > URL: https://issues.apache.org/jira/browse/SPARK-15643 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > > Update spark.ml and spark.mllib migration guide from 1.6 to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343036#comment-15343036 ] Wenchen Fan commented on SPARK-16032: - [~rdblue] I think the biggest problem is we don't have by-name insertion for hive table now. Although this feature is not in 1.6, but we do have it before the recent commits by us. Does it make more sense to support hive table in `DataFrameWriter.saveAsTable` before 2.0? > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Critical > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343029#comment-15343029 ] Wenchen Fan commented on SPARK-16032: - I think it's nonsense to use `partitionBy` with `insertInto`, as we can not map `DataFrameWriter.insertInto` to SQL INSERT for 2 reasons: 1. `DataFrameWriter` doesn't support static partition 2. `DataFrameWriter` specifies the partition columns of the data to insert, not the table to be inserted. And it's already broken(mostly) in 1.6, according to the test cases at https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only case that seems reasonable in 1.6 is when the data to insert has same schema with the table to be inserted and the `partitionBy` specifies the correct partition columns. But I think it's worth to break it and make the overall semantics more clear. Maybe we are wrong, it will be good if we come up with a clean semantics to explain the behavior of `DataFrame.insertInto`, but after spent a lot of time on it, we failed, and that's why we wanna make these changes and rush in into 2.0. > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Critical > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16117) Hide LibSVMFileFormat in public API docs
[ https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16117. - Resolution: Fixed Fix Version/s: 2.0.0 > Hide LibSVMFileFormat in public API docs > > > Key: SPARK-16117 > URL: https://issues.apache.org/jira/browse/SPARK-16117 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > LibSVMFileFormat implements data source for LIBSVM format. However, users do > not need to call its APIs to use it. So we should hide it in the public API > docs. The main issue is that we still need to put the documentation and > example code somewhere. The proposal it to have a dummy object to hold the > documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16118) getDropLast is missing in OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16118. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13821 [https://github.com/apache/spark/pull/13821] > getDropLast is missing in OneHotEncoder > --- > > Key: SPARK-16118 > URL: https://issues.apache.org/jira/browse/SPARK-16118 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it
[ https://issues.apache.org/jira/browse/SPARK-16119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-16119: --- Description: There's currently code that explicitly disables the "PURGE" flag when dropping a table: {code} if (ctx.PURGE != null) { throw operationNotAllowed("DROP TABLE ... PURGE", ctx) } {code} That flag is necessary in certain situations where the table data cannot be moved to the trash (which will be tried unless "PURGE" is requested). If the client supports it (Hive >= 0.14.0 according to the Hive docs), we should allow that option to be defined. For non-Hive tables, as far as I can understand, "PURGE" is the current behavior of Spark. The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so should probably also be covered by this change. was: There's currently code that explicitly disables the "PURGE" flag when dropping a table: {code} if (ctx.PURGE != null) { throw operationNotAllowed("DROP TABLE ... PURGE", ctx) } {code} That flag is necessary in certain situations where the table data cannot be moved to the trash (which will be tried unless "PURGE" is requested). If the client supports it (Hive >= 0.14.0 according to the Hive docs), we should allow that option to be defined. For non-Hive tables, as far as I can understand, "PURGE" is the current behavior of Spark. The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so should probably be covered. > Support "DROP TABLE ... PURGE" if Hive client supports it > - > > Key: SPARK-16119 > URL: https://issues.apache.org/jira/browse/SPARK-16119 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > > There's currently code that explicitly disables the "PURGE" flag when > dropping a table: > {code} > if (ctx.PURGE != null) { > throw operationNotAllowed("DROP TABLE ... PURGE", ctx) > } > {code} > That flag is necessary in certain situations where the table data cannot be > moved to the trash (which will be tried unless "PURGE" is requested). If the > client supports it (Hive >= 0.14.0 according to the Hive docs), we should > allow that option to be defined. > For non-Hive tables, as far as I can understand, "PURGE" is the current > behavior of Spark. > The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so > should probably also be covered by this change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342963#comment-15342963 ] Ryan Blue commented on SPARK-16032: --- I'm referring to disabling the use of {{partitionBy}} with {{insertInto}} and removing support for {{saveAsTable}} (from the doc: "I think it's fine we drop the support in 2.0"). In 1.6, {{partitionBy}} can be used to set up partition columns as they are expected by {{insertInto}}. What change caused the confusing and inconsistent behavior? Before this set of changes, {{partitionBy}} was validated against the table's partitioning (at least for Hive) like this change set suggests doing when it is used with {{saveAsTable}}. The insert SQL case is inconsistent, but there are other ways to solve that problem. bq. I do not think we should ship 2.0 without fixing these behaviors and try to fix them in future releases (the fix will possible change the behaviors again). I know that we want to get this out, but I don't think it is a good idea to put it in 2.0 before it's ready. This codifies that the "correct" way to write to a Hive table is to put the partition columns at the end rather than explicitly marking them, and it disallows marking those columns as you would using "PARTITION" in SQL. That's going to break jobs and I'm not confident that it's the right choice. > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Critical > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-16075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342930#comment-15342930 ] Miao Wang commented on SPARK-16075: --- I will follow on this one. Thanks! > Make VectorUDT/MatrixUDT singleton under spark.ml package > - > > Key: SPARK-16075 > URL: https://issues.apache.org/jira/browse/SPARK-16075 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Both VectorUDT and MatrixUDT are implemented as normal classes and their > could be multiple instances of it, which makes the equality checking and > pattern matching harder to implement. Even the APIs are private, switching to > a singleton pattern could simplify the development. > Required changes: > * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate) > * update UDTRegistration > * update code generation to support singleton UDTs > * update existing code to use getOrCreate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts
[ https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16106: Assignee: Apache Spark > TaskSchedulerImpl does not correctly handle new executors on existing hosts > --- > > Key: SPARK-16106 > URL: https://issues.apache.org/jira/browse/SPARK-16106 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Trivial > > The TaskSchedulerImpl updates the set of executors and hosts in each call to > {{resourceOffers}}. During this call, it also tracks whether there are any > new executors observed in {{newExecAvail}}: > {code} > executorIdToHost(o.executorId) = o.host > executorIdToTaskCount.getOrElseUpdate(o.executorId, 0) > if (!executorsByHost.contains(o.host)) { > executorsByHost(o.host) = new HashSet[String]() > executorAdded(o.executorId, o.host) > newExecAvail = true > } > {code} > However, this only detects when a new *host* is added, not when an additional > executor is added to an existing host (a relatively common event in dynamic > allocation). > The end result is that task locality and {{failedEpochs}} is not updated > correctly for new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts
[ https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16106: Assignee: (was: Apache Spark) > TaskSchedulerImpl does not correctly handle new executors on existing hosts > --- > > Key: SPARK-16106 > URL: https://issues.apache.org/jira/browse/SPARK-16106 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Priority: Trivial > > The TaskSchedulerImpl updates the set of executors and hosts in each call to > {{resourceOffers}}. During this call, it also tracks whether there are any > new executors observed in {{newExecAvail}}: > {code} > executorIdToHost(o.executorId) = o.host > executorIdToTaskCount.getOrElseUpdate(o.executorId, 0) > if (!executorsByHost.contains(o.host)) { > executorsByHost(o.host) = new HashSet[String]() > executorAdded(o.executorId, o.host) > newExecAvail = true > } > {code} > However, this only detects when a new *host* is added, not when an additional > executor is added to an existing host (a relatively common event in dynamic > allocation). > The end result is that task locality and {{failedEpochs}} is not updated > correctly for new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts
[ https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342926#comment-15342926 ] Apache Spark commented on SPARK-16106: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/13826 > TaskSchedulerImpl does not correctly handle new executors on existing hosts > --- > > Key: SPARK-16106 > URL: https://issues.apache.org/jira/browse/SPARK-16106 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Priority: Trivial > > The TaskSchedulerImpl updates the set of executors and hosts in each call to > {{resourceOffers}}. During this call, it also tracks whether there are any > new executors observed in {{newExecAvail}}: > {code} > executorIdToHost(o.executorId) = o.host > executorIdToTaskCount.getOrElseUpdate(o.executorId, 0) > if (!executorsByHost.contains(o.host)) { > executorsByHost(o.host) = new HashSet[String]() > executorAdded(o.executorId, o.host) > newExecAvail = true > } > {code} > However, this only detects when a new *host* is added, not when an additional > executor is added to an existing host (a relatively common event in dynamic > allocation). > The end result is that task locality and {{failedEpochs}} is not updated > correctly for new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342917#comment-15342917 ] MIN-FU YANG commented on SPARK-14172: - Hi, I cannot reproduce the problem in master branch. Maybe it's resolved in newer version. I'll try reproduce it in 1.6.1 > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16121) ListingFileCatalog does not list in parallel anymore
Yin Huai created SPARK-16121: Summary: ListingFileCatalog does not list in parallel anymore Key: SPARK-16121 URL: https://issues.apache.org/jira/browse/SPARK-16121 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Blocker In ListingFileCatalog, the implementation of {{listLeafFiles}} is shown below. When the number of user-provided paths is less than the value of {{sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold}}, we will not use parallel listing, which is different from what 1.6 does (for 1.6, if the number of children of any inner dir is larger than the threshold, we will use the parallel listing). {code} protected def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = { if (paths.length >= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sparkSession) } else { // Dummy jobconf to get to the pathFilter defined in configuration val jobConf = new JobConf(hadoopConf, this.getClass) val pathFilter = FileInputFormat.getInputPathFilter(jobConf) val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") Try { HadoopFsRelation.listLeafFiles(fs, fs.getFileStatus(path), pathFilter) }.getOrElse(Array.empty[FileStatus]) } mutable.LinkedHashSet(statuses: _*) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342856#comment-15342856 ] Yin Huai commented on SPARK-16032: -- Regarding {{disabling Hive features}}, can you be more specific? Looking at the summary, we actually changed the behaviors from 1.6, which caused confusing and inconsistent behaviors. I do not think we should ship 2.0 without fixing these behaviors and try to fix them in future releases (the fix will possible change the behaviors again). > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Blocker > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16032: - Priority: Critical (was: Blocker) > Audit semantics of various insertion operations related to partitioned tables > - > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan >Priority: Critical > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter
[ https://issues.apache.org/jira/browse/SPARK-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16120: Assignee: (was: Apache Spark) > getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" > case uses external variable instead of the passed parameter > -- > > Key: SPARK-16120 > URL: https://issues.apache.org/jira/browse/SPARK-16120 > Project: Spark > Issue Type: Test > Components: Streaming, Tests >Reporter: Ahmed Mahran >Priority: Trivial > Labels: easyfix, newbie, test > > In ReceiverSuite.scala, in the test case "write ahead log - generating and > cleaning", the inner method getCurrentLogFiles uses external variable > logDirectory1 instead of the passed parameter logDirectory > {code} > def getCurrentLogFiles(logDirectory: File): Seq[String] = { > try { > if (logDirectory.exists()) { > logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { > _.toString } > } else { > Seq.empty > } > } catch { > case e: Exception => > Seq.empty > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter
[ https://issues.apache.org/jira/browse/SPARK-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16120: Assignee: Apache Spark > getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" > case uses external variable instead of the passed parameter > -- > > Key: SPARK-16120 > URL: https://issues.apache.org/jira/browse/SPARK-16120 > Project: Spark > Issue Type: Test > Components: Streaming, Tests >Reporter: Ahmed Mahran >Assignee: Apache Spark >Priority: Trivial > Labels: easyfix, newbie, test > > In ReceiverSuite.scala, in the test case "write ahead log - generating and > cleaning", the inner method getCurrentLogFiles uses external variable > logDirectory1 instead of the passed parameter logDirectory > {code} > def getCurrentLogFiles(logDirectory: File): Seq[String] = { > try { > if (logDirectory.exists()) { > logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { > _.toString } > } else { > Seq.empty > } > } catch { > case e: Exception => > Seq.empty > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter
[ https://issues.apache.org/jira/browse/SPARK-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342852#comment-15342852 ] Apache Spark commented on SPARK-16120: -- User 'ahmed-mahran' has created a pull request for this issue: https://github.com/apache/spark/pull/13825 > getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" > case uses external variable instead of the passed parameter > -- > > Key: SPARK-16120 > URL: https://issues.apache.org/jira/browse/SPARK-16120 > Project: Spark > Issue Type: Test > Components: Streaming, Tests >Reporter: Ahmed Mahran >Priority: Trivial > Labels: easyfix, newbie, test > > In ReceiverSuite.scala, in the test case "write ahead log - generating and > cleaning", the inner method getCurrentLogFiles uses external variable > logDirectory1 instead of the passed parameter logDirectory > {code} > def getCurrentLogFiles(logDirectory: File): Seq[String] = { > try { > if (logDirectory.exists()) { > logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { > _.toString } > } else { > Seq.empty > } > } catch { > case e: Exception => > Seq.empty > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan
[ https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342850#comment-15342850 ] MIN-FU YANG commented on SPARK-15326: - I'll look into it. > Doing multiple unions on a Dataframe will result in a very inefficient query > plan > - > > Key: SPARK-15326 > URL: https://issues.apache.org/jira/browse/SPARK-15326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Jurriaan Pruis > Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt > > > While working with a very skewed dataset I noticed that repeated unions on a > dataframe will result in a query plan with 2^(union) - 1 unions. With large > datasets this will be very inefficient. > I tried to replicate this behaviour using a PySpark example (I've attached > the output of the explain() to this JIRA): > {code} > df = sqlCtx.range(1000) > def r(name, max_val=100): > return F.round(F.lit(max_val) * F.pow(F.rand(), > 4)).cast('integer').alias(name) > # Create a skewed dataset > skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f')) > # Find the skewed values in the dataset > top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], > 0.10).collect()[0] > def skewjoin(skewed, right, column, freqItems): > freqItems = freqItems[column + '_freqItems'] > skewed = skewed.alias('skewed') > cond = F.col(column).isin(freqItems) > # First broadcast join the frequent (skewed) values > filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), > column, 'left_outer') > # Use a regular join for the non skewed values (with big tables this will > use a SortMergeJoin) > non_skewed = skewed.filter(cond == False).join(right.filter(cond == > False), column, 'left_outer') > # join them together and replace the column with the column found in the > right DataFrame > return filtered.unionAll(non_skewed).select('skewed.*', > right['id'].alias(column + '_key')).drop(column) > # Create the dataframes that will be joined to the skewed dataframe > right_size = 100 > df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a')) > df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b')) > df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c')) > df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d')) > df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e')) > df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f')) > # Join everything together > df = skewed > df = skewjoin(df, df_a, 'a', top_10_percent) > df = skewjoin(df, df_b, 'b', top_10_percent) > df = skewjoin(df, df_c, 'c', top_10_percent) > df = skewjoin(df, df_d, 'd', top_10_percent) > df = skewjoin(df, df_e, 'e', top_10_percent) > df = skewjoin(df, df_f, 'f', top_10_percent) > # df.explain() shows the plan where it does 63 unions > (2^(number_of_skewjoins) - 1) > # which will be very inefficient and slow > df.explain(True) > # Evaluate the plan > # You'd expect this to return 1000, but it does not, it returned 1140 > on my system > # (probably because it will recalculate the random columns? Not sure though) > print(df.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16120) getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter
Ahmed Mahran created SPARK-16120: Summary: getCurrentLogFiles method in ReceiverSuite "WAL - generating and cleaning" case uses external variable instead of the passed parameter Key: SPARK-16120 URL: https://issues.apache.org/jira/browse/SPARK-16120 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Ahmed Mahran Priority: Trivial In ReceiverSuite.scala, in the test case "write ahead log - generating and cleaning", the inner method getCurrentLogFiles uses external variable logDirectory1 instead of the passed parameter logDirectory {code} def getCurrentLogFiles(logDirectory: File): Seq[String] = { try { if (logDirectory.exists()) { logDirectory1.listFiles().filter { _.getName.startsWith("log") }.map { _.toString } } else { Seq.empty } } catch { case e: Exception => Seq.empty } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16110) Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set
[ https://issues.apache.org/jira/browse/SPARK-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16110: Assignee: (was: Apache Spark) > Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & > PYSPARK_DRIVER_PYTHON are set > --- > > Key: SPARK-16110 > URL: https://issues.apache.org/jira/browse/SPARK-16110 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-38-generic x86_64), > Spark 1.6.1, Azure HDInsight 3.4) >Reporter: Kevin Grealish > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > When a cluster has PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment > variables set (needed for using non-system Python e.g. > /usr/bin/anaconda/bin/python), then you are unable to override this per > submission in YARN cluster mode. > When using spark-submit (in this case via LIVY) to submit with an override: > spark-submit --master yarn --deploy-mode cluster --conf > 'spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3' --conf' > 'spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3' probe.py > the environment variable values will override the conf settings. A workaround > for some can be to unset the env vars but that is not always possible (e.g. > submitting batch via LIVY where you can only pass through the parameters to > spark-submit). > Expectation is that the conf values above override the environment variables. > Fix is to change the order of application of conf and env vars in the yarn > client. > Related discussion:https://issues.cloudera.org/browse/LIVY-159 > Backporting this to 1.6 would be great and unblocking for me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16110) Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set
[ https://issues.apache.org/jira/browse/SPARK-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16110: Assignee: Apache Spark > Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & > PYSPARK_DRIVER_PYTHON are set > --- > > Key: SPARK-16110 > URL: https://issues.apache.org/jira/browse/SPARK-16110 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-38-generic x86_64), > Spark 1.6.1, Azure HDInsight 3.4) >Reporter: Kevin Grealish >Assignee: Apache Spark > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > When a cluster has PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment > variables set (needed for using non-system Python e.g. > /usr/bin/anaconda/bin/python), then you are unable to override this per > submission in YARN cluster mode. > When using spark-submit (in this case via LIVY) to submit with an override: > spark-submit --master yarn --deploy-mode cluster --conf > 'spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3' --conf' > 'spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3' probe.py > the environment variable values will override the conf settings. A workaround > for some can be to unset the env vars but that is not always possible (e.g. > submitting batch via LIVY where you can only pass through the parameters to > spark-submit). > Expectation is that the conf values above override the environment variables. > Fix is to change the order of application of conf and env vars in the yarn > client. > Related discussion:https://issues.cloudera.org/browse/LIVY-159 > Backporting this to 1.6 would be great and unblocking for me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16110) Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set
[ https://issues.apache.org/jira/browse/SPARK-16110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342831#comment-15342831 ] Apache Spark commented on SPARK-16110: -- User 'KevinGrealish' has created a pull request for this issue: https://github.com/apache/spark/pull/13824 > Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & > PYSPARK_DRIVER_PYTHON are set > --- > > Key: SPARK-16110 > URL: https://issues.apache.org/jira/browse/SPARK-16110 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-38-generic x86_64), > Spark 1.6.1, Azure HDInsight 3.4) >Reporter: Kevin Grealish > Labels: easyfix > Original Estimate: 2h > Remaining Estimate: 2h > > When a cluster has PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment > variables set (needed for using non-system Python e.g. > /usr/bin/anaconda/bin/python), then you are unable to override this per > submission in YARN cluster mode. > When using spark-submit (in this case via LIVY) to submit with an override: > spark-submit --master yarn --deploy-mode cluster --conf > 'spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=python3' --conf' > 'spark.yarn.appMasterEnv.PYSPARK_PYTHON=python3' probe.py > the environment variable values will override the conf settings. A workaround > for some can be to unset the env vars but that is not always possible (e.g. > submitting batch via LIVY where you can only pass through the parameters to > spark-submit). > Expectation is that the conf values above override the environment variables. > Fix is to change the order of application of conf and env vars in the yarn > client. > Related discussion:https://issues.cloudera.org/browse/LIVY-159 > Backporting this to 1.6 would be great and unblocking for me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts
[ https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342821#comment-15342821 ] Imran Rashid commented on SPARK-16106: -- cc [~kayousterhout] After taking a closer look at this, I don't think its really a very big issue. Since you've already got some executor on the host, your locality level will already include *at least* NODE_LOCAL. Its possible that when you add another executor, now you can go to PROCESS_LOCAL -- but I can't think of how you'd have a task which wants to be PROCESS_LOCAL in an executor which doesn't exist yet. There is also the issue w/ failedEpoch, I haven't wrapped my head around what the ramifications of that are yet. In any case, its confusing at the very least, might as well fix this. But it might be reasonable to simply change {{executorAdded()}} to {{hostAdded()}} to clear up the naming, and leave the current behavior as well. > TaskSchedulerImpl does not correctly handle new executors on existing hosts > --- > > Key: SPARK-16106 > URL: https://issues.apache.org/jira/browse/SPARK-16106 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Priority: Trivial > > The TaskSchedulerImpl updates the set of executors and hosts in each call to > {{resourceOffers}}. During this call, it also tracks whether there are any > new executors observed in {{newExecAvail}}: > {code} > executorIdToHost(o.executorId) = o.host > executorIdToTaskCount.getOrElseUpdate(o.executorId, 0) > if (!executorsByHost.contains(o.host)) { > executorsByHost(o.host) = new HashSet[String]() > executorAdded(o.executorId, o.host) > newExecAvail = true > } > {code} > However, this only detects when a new *host* is added, not when an additional > executor is added to an existing host (a relatively common event in dynamic > allocation). > The end result is that task locality and {{failedEpochs}} is not updated > correctly for new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts
[ https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-16106: - Priority: Trivial (was: Major) > TaskSchedulerImpl does not correctly handle new executors on existing hosts > --- > > Key: SPARK-16106 > URL: https://issues.apache.org/jira/browse/SPARK-16106 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Priority: Trivial > > The TaskSchedulerImpl updates the set of executors and hosts in each call to > {{resourceOffers}}. During this call, it also tracks whether there are any > new executors observed in {{newExecAvail}}: > {code} > executorIdToHost(o.executorId) = o.host > executorIdToTaskCount.getOrElseUpdate(o.executorId, 0) > if (!executorsByHost.contains(o.host)) { > executorsByHost(o.host) = new HashSet[String]() > executorAdded(o.executorId, o.host) > newExecAvail = true > } > {code} > However, this only detects when a new *host* is added, not when an additional > executor is added to an existing host (a relatively common event in dynamic > allocation). > The end result is that task locality and {{failedEpochs}} is not updated > correctly for new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15606) Driver hang in o.a.s.DistributedSuite on 2 core machine
[ https://issues.apache.org/jira/browse/SPARK-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-15606: - Fix Version/s: 1.6.2 > Driver hang in o.a.s.DistributedSuite on 2 core machine > --- > > Key: SPARK-15606 > URL: https://issues.apache.org/jira/browse/SPARK-15606 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: AMD64 box with only 2 cores >Reporter: Pete Robbins >Assignee: Pete Robbins > Fix For: 1.6.2, 2.0.0 > > > repeatedly failing task that crashes JVM *** FAILED *** > The code passed to failAfter did not complete within 10 milliseconds. > (DistributedSuite.scala:128) > This test started failing and DistrbutedSuite hanging following > https://github.com/apache/spark/pull/13055 > It looks like the extra message to remove the BlockManager deadlocks as there > are only 2 message processing loop threads. Related to > https://issues.apache.org/jira/browse/SPARK-13906 > {code} > /** Thread pool used for dispatching messages. */ > private val threadpool: ThreadPoolExecutor = { > val numThreads = > nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads", > math.max(2, Runtime.getRuntime.availableProcessors())) > val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, > "dispatcher-event-loop") > for (i <- 0 until numThreads) { > pool.execute(new MessageLoop) > } > pool > } > {code} > Setting a minimum of 3 threads alleviates this issue but I'm not sure there > isn't another underlying problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15606) Driver hang in o.a.s.DistributedSuite on 2 core machine
[ https://issues.apache.org/jira/browse/SPARK-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-15606: - Affects Version/s: 1.6.2 > Driver hang in o.a.s.DistributedSuite on 2 core machine > --- > > Key: SPARK-15606 > URL: https://issues.apache.org/jira/browse/SPARK-15606 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: AMD64 box with only 2 cores >Reporter: Pete Robbins >Assignee: Pete Robbins > Fix For: 1.6.2, 2.0.0 > > > repeatedly failing task that crashes JVM *** FAILED *** > The code passed to failAfter did not complete within 10 milliseconds. > (DistributedSuite.scala:128) > This test started failing and DistrbutedSuite hanging following > https://github.com/apache/spark/pull/13055 > It looks like the extra message to remove the BlockManager deadlocks as there > are only 2 message processing loop threads. Related to > https://issues.apache.org/jira/browse/SPARK-13906 > {code} > /** Thread pool used for dispatching messages. */ > private val threadpool: ThreadPoolExecutor = { > val numThreads = > nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads", > math.max(2, Runtime.getRuntime.availableProcessors())) > val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, > "dispatcher-event-loop") > for (i <- 0 until numThreads) { > pool.execute(new MessageLoop) > } > pool > } > {code} > Setting a minimum of 3 threads alleviates this issue but I'm not sure there > isn't another underlying problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16119) Support "DROP TABLE ... PURGE" if Hive client supports it
Marcelo Vanzin created SPARK-16119: -- Summary: Support "DROP TABLE ... PURGE" if Hive client supports it Key: SPARK-16119 URL: https://issues.apache.org/jira/browse/SPARK-16119 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Marcelo Vanzin There's currently code that explicitly disables the "PURGE" flag when dropping a table: {code} if (ctx.PURGE != null) { throw operationNotAllowed("DROP TABLE ... PURGE", ctx) } {code} That flag is necessary in certain situations where the table data cannot be moved to the trash (which will be tried unless "PURGE" is requested). If the client supports it (Hive >= 0.14.0 according to the Hive docs), we should allow that option to be defined. For non-Hive tables, as far as I can understand, "PURGE" is the current behavior of Spark. The same limitation exists currently for "ALTER TABLE ... DROP PARTITION" so should probably be covered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message
[ https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16115: Assignee: (was: Apache Spark) > Improve output column name for SHOW PARTITIONS command and improve an error > message > --- > > Key: SPARK-16115 > URL: https://issues.apache.org/jira/browse/SPARK-16115 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sunitha Kambhampati >Priority: Minor > > Opening this issue to address the following: > 1. For the SHOW PARTITIONS command, the column name in the output is > 'result'. In Hive, the column name is 'partition' which is more > descriptive. Change the output schema, column name from 'result' to > 'partition' > 2. Corner case: Improve the error message for a non-existent table > Right now, the below error is thrown: > scala> spark.sql("show partitions t1"); > org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on > a table that is not partitioned: default.t1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message
[ https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16115: Assignee: Apache Spark > Improve output column name for SHOW PARTITIONS command and improve an error > message > --- > > Key: SPARK-16115 > URL: https://issues.apache.org/jira/browse/SPARK-16115 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sunitha Kambhampati >Assignee: Apache Spark >Priority: Minor > > Opening this issue to address the following: > 1. For the SHOW PARTITIONS command, the column name in the output is > 'result'. In Hive, the column name is 'partition' which is more > descriptive. Change the output schema, column name from 'result' to > 'partition' > 2. Corner case: Improve the error message for a non-existent table > Right now, the below error is thrown: > scala> spark.sql("show partitions t1"); > org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on > a table that is not partitioned: default.t1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message
[ https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342796#comment-15342796 ] Apache Spark commented on SPARK-16115: -- User 'skambha' has created a pull request for this issue: https://github.com/apache/spark/pull/13822 > Improve output column name for SHOW PARTITIONS command and improve an error > message > --- > > Key: SPARK-16115 > URL: https://issues.apache.org/jira/browse/SPARK-16115 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sunitha Kambhampati >Priority: Minor > > Opening this issue to address the following: > 1. For the SHOW PARTITIONS command, the column name in the output is > 'result'. In Hive, the column name is 'partition' which is more > descriptive. Change the output schema, column name from 'result' to > 'partition' > 2. Corner case: Improve the error message for a non-existent table > Right now, the below error is thrown: > scala> spark.sql("show partitions t1"); > org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on > a table that is not partitioned: default.t1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16118) getDropLast is missing in OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342790#comment-15342790 ] Apache Spark commented on SPARK-16118: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/13821 > getDropLast is missing in OneHotEncoder > --- > > Key: SPARK-16118 > URL: https://issues.apache.org/jira/browse/SPARK-16118 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16118) getDropLast is missing in OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16118: Assignee: Xiangrui Meng (was: Apache Spark) > getDropLast is missing in OneHotEncoder > --- > > Key: SPARK-16118 > URL: https://issues.apache.org/jira/browse/SPARK-16118 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16118) getDropLast is missing in OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16118: Assignee: Apache Spark (was: Xiangrui Meng) > getDropLast is missing in OneHotEncoder > --- > > Key: SPARK-16118 > URL: https://issues.apache.org/jira/browse/SPARK-16118 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16118) getDropLast is missing in OneHotEncoder
Xiangrui Meng created SPARK-16118: - Summary: getDropLast is missing in OneHotEncoder Key: SPARK-16118 URL: https://issues.apache.org/jira/browse/SPARK-16118 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.6.1, 1.5.2, 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342773#comment-15342773 ] Apache Spark commented on SPARK-16107: -- User 'junyangq' has created a pull request for this issue: https://github.com/apache/spark/pull/13820 > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Labels: starter > > Group API docs of spark.glm, glm, predict(GLM), summary(GLM), > read/write.ml(GLM) under Rd spark.glm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16107: Assignee: Junyang Qian (was: Apache Spark) > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Labels: starter > > Group API docs of spark.glm, glm, predict(GLM), summary(GLM), > read/write.ml(GLM) under Rd spark.glm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16107: Assignee: Apache Spark (was: Junyang Qian) > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > Labels: starter > > Group API docs of spark.glm, glm, predict(GLM), summary(GLM), > read/write.ml(GLM) under Rd spark.glm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16117) Hide LibSVMFileFormat in public API docs
[ https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342741#comment-15342741 ] Apache Spark commented on SPARK-16117: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/13819 > Hide LibSVMFileFormat in public API docs > > > Key: SPARK-16117 > URL: https://issues.apache.org/jira/browse/SPARK-16117 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > LibSVMFileFormat implements data source for LIBSVM format. However, users do > not need to call its APIs to use it. So we should hide it in the public API > docs. The main issue is that we still need to put the documentation and > example code somewhere. The proposal it to have a dummy object to hold the > documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16117) Hide LibSVMFileFormat in public API docs
[ https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16117: Assignee: Apache Spark (was: Xiangrui Meng) > Hide LibSVMFileFormat in public API docs > > > Key: SPARK-16117 > URL: https://issues.apache.org/jira/browse/SPARK-16117 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > LibSVMFileFormat implements data source for LIBSVM format. However, users do > not need to call its APIs to use it. So we should hide it in the public API > docs. The main issue is that we still need to put the documentation and > example code somewhere. The proposal it to have a dummy object to hold the > documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16117) Hide LibSVMFileFormat in public API docs
[ https://issues.apache.org/jira/browse/SPARK-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16117: Assignee: Xiangrui Meng (was: Apache Spark) > Hide LibSVMFileFormat in public API docs > > > Key: SPARK-16117 > URL: https://issues.apache.org/jira/browse/SPARK-16117 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > LibSVMFileFormat implements data source for LIBSVM format. However, users do > not need to call its APIs to use it. So we should hide it in the public API > docs. The main issue is that we still need to put the documentation and > example code somewhere. The proposal it to have a dummy object to hold the > documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342731#comment-15342731 ] Michael Allman commented on SPARK-15968: I've created a revised PR for this issue, https://github.com/apache/spark/pull/13818. > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman > Labels: hive, metastore > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for non-empty partitioned > tables. As a result, cache lookups on non-empty partitioned tables always > miss and these relations are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342730#comment-15342730 ] Apache Spark commented on SPARK-15968: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/13818 > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman > Labels: hive, metastore > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for non-empty partitioned > tables. As a result, cache lookups on non-empty partitioned tables always > miss and these relations are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers
[ https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342728#comment-15342728 ] Apache Spark commented on SPARK-15997: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/13745 > Audit ml.feature Update documentation for ml feature transformers > - > > Key: SPARK-15997 > URL: https://issues.apache.org/jira/browse/SPARK-15997 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Gayathri Murali >Assignee: Gayathri Murali > > This JIRA is a subtask of SPARK-15100 and improves documentation for new > features added to > 1. HashingTF > 2. Countvectorizer > 3. QuantileDiscretizer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16116) ConsoleSink should not require checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16116: Assignee: Shixiong Zhu (was: Apache Spark) > ConsoleSink should not require checkpointLocation > - > > Key: SPARK-16116 > URL: https://issues.apache.org/jira/browse/SPARK-16116 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > ConsoleSink should not require checkpointLocation since it's for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16116) ConsoleSink should not require checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342708#comment-15342708 ] Apache Spark commented on SPARK-16116: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/13817 > ConsoleSink should not require checkpointLocation > - > > Key: SPARK-16116 > URL: https://issues.apache.org/jira/browse/SPARK-16116 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > ConsoleSink should not require checkpointLocation since it's for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16116) ConsoleSink should not require checkpointLocation
[ https://issues.apache.org/jira/browse/SPARK-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16116: Assignee: Apache Spark (was: Shixiong Zhu) > ConsoleSink should not require checkpointLocation > - > > Key: SPARK-16116 > URL: https://issues.apache.org/jira/browse/SPARK-16116 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > ConsoleSink should not require checkpointLocation since it's for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16117) Hide LibSVMFileFormat in public API docs
Xiangrui Meng created SPARK-16117: - Summary: Hide LibSVMFileFormat in public API docs Key: SPARK-16117 URL: https://issues.apache.org/jira/browse/SPARK-16117 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng LibSVMFileFormat implements data source for LIBSVM format. However, users do not need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy object to hold the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16116) ConsoleSink should not require checkpointLocation
Shixiong Zhu created SPARK-16116: Summary: ConsoleSink should not require checkpointLocation Key: SPARK-16116 URL: https://issues.apache.org/jira/browse/SPARK-16116 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu ConsoleSink should not require checkpointLocation since it's for debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16114) Add network word count example
[ https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16114: Assignee: (was: Apache Spark) > Add network word count example > -- > > Key: SPARK-16114 > URL: https://issues.apache.org/jira/browse/SPARK-16114 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: James Thomas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342696#comment-15342696 ] Felix Cheung edited comment on SPARK-16090 at 6/21/16 9:08 PM: --- This is for example the html output for gapply {code} # S4 method for signature 'GroupedData' gapply(x, func, schema) ## S4 method for signature 'SparkDataFrame' gapply(x, cols, func, schema) Arguments x a GroupedData func A function to be applied to each group partition specified by GroupedData. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. cols Grouping columns x A SparkDataFrame func A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. {code} As you can see, func and schema (and x) are listed twice with different wording under Arguments. We should see if we could explain it one way and list them once only. (ie. one copy of "@param func") was (Author: felixcheung): This is for example the html output for gapply {code} # S4 method for signature 'GroupedData' gapply(x, func, schema) ## S4 method for signature 'SparkDataFrame' gapply(x, cols, func, schema) Arguments x a GroupedData func A function to be applied to each group partition specified by GroupedData. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. cols Grouping columns x A SparkDataFrame func A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. {code} As you can see, func and schema are listed twice with different wording under Arguments. We should see if we could explain it one way and list them once only. (ie. one copy of "@param func") > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16114) Add network word count example
[ https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342700#comment-15342700 ] Apache Spark commented on SPARK-16114: -- User 'jjthomas' has created a pull request for this issue: https://github.com/apache/spark/pull/13816 > Add network word count example > -- > > Key: SPARK-16114 > URL: https://issues.apache.org/jira/browse/SPARK-16114 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: James Thomas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16114) Add network word count example
[ https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16114: Assignee: Apache Spark > Add network word count example > -- > > Key: SPARK-16114 > URL: https://issues.apache.org/jira/browse/SPARK-16114 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: James Thomas >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342696#comment-15342696 ] Felix Cheung commented on SPARK-16090: -- This is for example the html output for gapply {code} # S4 method for signature 'GroupedData' gapply(x, func, schema) ## S4 method for signature 'SparkDataFrame' gapply(x, cols, func, schema) Arguments x a GroupedData func A function to be applied to each group partition specified by GroupedData. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. cols Grouping columns x A SparkDataFrame func A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. {code} As you can see, func and schema are listed twice with different wording under Arguments. We should see if we could explain it one way and list them once only. (ie. one copy of "@param func") > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16105) PCA Reverse Transformer
[ https://issues.apache.org/jira/browse/SPARK-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16105: -- Flags: (was: Important) Priority: Minor (was: Major) The transformation is to a lower-dimensional space, so it's not possible in general to reverse it. You can, but you still end up with a 15-dimensional subspace of 96-dimensional space in your example. The matrix is all the information available to make this transformation and it's available already. > PCA Reverse Transformer > --- > > Key: SPARK-16105 > URL: https://issues.apache.org/jira/browse/SPARK-16105 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 1.6.1 >Reporter: Stefan Panayotov >Priority: Minor > > The PCA class has a fit method that returns a PCAModel. One of the members of > the PCAModel is a pc (Principal Components Matrix). This matrix is available > for inspection, but there is no method to use this matrix for reverse > transformation back to the original dimension. For example, if I use the PCA > to reduce dimensionality of my space from 96 to 15, I get a 96x15 pc Matrix. > I can do some modeling in my reduced space and then I need to reverse back > to the original 96 dimensional space. Basically, I need to multiply my 15 > dimensional vectors by the 96x15 pc Matrix to get back 96 dimensional > vectors. Such method is missing from the PCA model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message
[ https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342689#comment-15342689 ] Sunitha Kambhampati commented on SPARK-16115: - I will submit a PR shortly. > Improve output column name for SHOW PARTITIONS command and improve an error > message > --- > > Key: SPARK-16115 > URL: https://issues.apache.org/jira/browse/SPARK-16115 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sunitha Kambhampati >Priority: Minor > > Opening this issue to address the following: > 1. For the SHOW PARTITIONS command, the column name in the output is > 'result'. In Hive, the column name is 'partition' which is more > descriptive. Change the output schema, column name from 'result' to > 'partition' > 2. Corner case: Improve the error message for a non-existent table > Right now, the below error is thrown: > scala> spark.sql("show partitions t1"); > org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on > a table that is not partitioned: default.t1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16115) Improve output column name for SHOW PARTITIONS command and improve an error message
[ https://issues.apache.org/jira/browse/SPARK-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-16115: Description: Opening this issue to address the following: 1. For the SHOW PARTITIONS command, the column name in the output is 'result'. In Hive, the column name is 'partition' which is more descriptive. Change the output schema, column name from 'result' to 'partition' 2. Corner case: Improve the error message for a non-existent table Right now, the below error is thrown: scala> spark.sql("show partitions t1"); org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a table that is not partitioned: default.t1; was: Opening this issue to address the following: 1. For the SHOW PARTITIONS command, the column name in the output is 'result'. In Hive, the column name is 'partition' which is more descriptive. 2. Corner case: Improve the error message for a non-existent table Right now, the below error is thrown: scala> spark.sql("show partitions t1"); org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a table that is not partitioned: default.t1; > Improve output column name for SHOW PARTITIONS command and improve an error > message > --- > > Key: SPARK-16115 > URL: https://issues.apache.org/jira/browse/SPARK-16115 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sunitha Kambhampati >Priority: Minor > > Opening this issue to address the following: > 1. For the SHOW PARTITIONS command, the column name in the output is > 'result'. In Hive, the column name is 'partition' which is more > descriptive. Change the output schema, column name from 'result' to > 'partition' > 2. Corner case: Improve the error message for a non-existent table > Right now, the below error is thrown: > scala> spark.sql("show partitions t1"); > org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on > a table that is not partitioned: default.t1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org