[jira] [Updated] (SPARK-19623) Take rows from DataFrame with empty first partition
[ https://issues.apache.org/jira/browse/SPARK-19623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jaeboo Jung updated SPARK-19623: Description: I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions having a empty first partition, DataFrame and its RDD have different behaviors during taking rows from it. If we take only 1000 rows from DataFrame, it causes OOME but RDD is OK. In detail, DataFrame without a empty first partition => OK DataFrame with a empty first partition => OOME RDD of DataFrame with a empty first partition => OK Codes below reproduce this error. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(1 to 1,1000).map(i => Row.fromSeq(Array.fill(100)(i))) val schema = StructType(for(i <- 1 to 100) yield { StructField("COL"+i,IntegerType, true) }) val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter) val df1 = sqlContext.createDataFrame(rdd,schema) df1.take(1000) // OK val df2 = sqlContext.createDataFrame(rdd2,schema) df2.rdd.take(1000) // OK df2.take(1000) // OOME {code} I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor memory. was: I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions having a empty first partition, DataFrame and its RDD have different behaviors during taking rows from it. If we take only 1000 rows from DataFrame, it causes OOME but RDD is OK. In detail, DataFrame without a empty first partition => OK DataFrame with a empty first partition => OOME RDD of DataFrame with a empty first partition => OK Codes below reproduce this error. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(1 to 1,1000).map(i => Row.fromSeq(Array.fill(100)(i))) val schema = StructType(for(i <- 1 to 100) yield { StructField("COL"+i,IntegerType, true) }) val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter) val df1 = sqlContext.createDataFrame(rdd,schema) df1.take(1000) // OK val df2 = sqlContext.createDataFrame(rdd2,schema) df2.rdd.take(1000) // OK df2.take(1000) // OOME {code} > Take rows from DataFrame with empty first partition > --- > > Key: SPARK-19623 > URL: https://issues.apache.org/jira/browse/SPARK-19623 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Jaeboo Jung >Priority: Minor > > I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions > having a empty first partition, DataFrame and its RDD have different > behaviors during taking rows from it. If we take only 1000 rows from > DataFrame, it causes OOME but RDD is OK. > In detail, > DataFrame without a empty first partition => OK > DataFrame with a empty first partition => OOME > RDD of DataFrame with a empty first partition => OK > Codes below reproduce this error. > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val rdd = sc.parallelize(1 to 1,1000).map(i => > Row.fromSeq(Array.fill(100)(i))) > val schema = StructType(for(i <- 1 to 100) yield { > StructField("COL"+i,IntegerType, true) > }) > val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) > Iterator[Row]() else iter) > val df1 = sqlContext.createDataFrame(rdd,schema) > df1.take(1000) // OK > val df2 = sqlContext.createDataFrame(rdd2,schema) > df2.rdd.take(1000) // OK > df2.take(1000) // OOME > {code} > I tested it on Spark 1.6.2 with 2gb of driver memory and 5gb of executor > memory. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19623) Take rows from DataFrame with empty first partition
Jaeboo Jung created SPARK-19623: --- Summary: Take rows from DataFrame with empty first partition Key: SPARK-19623 URL: https://issues.apache.org/jira/browse/SPARK-19623 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.2 Reporter: Jaeboo Jung Priority: Minor I use Spark 1.6.2 with 1 master and 6 workers. Assuming we have partitions having a empty first partition, DataFrame and its RDD have different behaviors during taking rows from it. If we take only 1000 rows from DataFrame, it causes OOME but RDD is OK. In detail, DataFrame without a empty first partition => OK DataFrame with a empty first partition => OOME RDD of DataFrame with a empty first partition => OK Codes below reproduce this error. {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val rdd = sc.parallelize(1 to 1,1000).map(i => Row.fromSeq(Array.fill(100)(i))) val schema = StructType(for(i <- 1 to 100) yield { StructField("COL"+i,IntegerType, true) }) val rdd2 = rdd.mapPartitionsWithIndex((idx,iter) => if(idx==0 || idx==1) Iterator[Row]() else iter) val df1 = sqlContext.createDataFrame(rdd,schema) df1.take(1000) // OK val df2 = sqlContext.createDataFrame(rdd2,schema) df2.rdd.take(1000) // OK df2.take(1000) // OOME {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18285) approxQuantile in R support multi-column
[ https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18285: Assignee: (was: Apache Spark) > approxQuantile in R support multi-column > > > Key: SPARK-18285 > URL: https://issues.apache.org/jira/browse/SPARK-18285 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: zhengruifeng > > approxQuantile in R should support multi-column. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18285) approxQuantile in R support multi-column
[ https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869400#comment-15869400 ] Apache Spark commented on SPARK-18285: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/16951 > approxQuantile in R support multi-column > > > Key: SPARK-18285 > URL: https://issues.apache.org/jira/browse/SPARK-18285 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: zhengruifeng > > approxQuantile in R should support multi-column. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19619) SparkR approxQuantile supports input multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang closed SPARK-19619. --- Resolution: Duplicate > SparkR approxQuantile supports input multiple columns > - > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Priority: Minor > > SparkR approxQuantile supports input multiple columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18285) approxQuantile in R support multi-column
[ https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18285: Assignee: Apache Spark > approxQuantile in R support multi-column > > > Key: SPARK-18285 > URL: https://issues.apache.org/jira/browse/SPARK-18285 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: zhengruifeng >Assignee: Apache Spark > > approxQuantile in R should support multi-column. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
[ https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19622: Assignee: Apache Spark > Fix a http error in a paged table when using a `Go` button to search. > - > > Key: SPARK-19622 > URL: https://issues.apache.org/jira/browse/SPARK-19622 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: Apache Spark >Priority: Minor > Attachments: screenshot-1.png > > > The search function of paged table is not available because of we don't skip > the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
[ https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19622: Assignee: (was: Apache Spark) > Fix a http error in a paged table when using a `Go` button to search. > - > > Key: SPARK-19622 > URL: https://issues.apache.org/jira/browse/SPARK-19622 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Minor > Attachments: screenshot-1.png > > > The search function of paged table is not available because of we don't skip > the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
[ https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869389#comment-15869389 ] Apache Spark commented on SPARK-19622: -- User 'stanzhai' has created a pull request for this issue: https://github.com/apache/spark/pull/16953 > Fix a http error in a paged table when using a `Go` button to search. > - > > Key: SPARK-19622 > URL: https://issues.apache.org/jira/browse/SPARK-19622 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Minor > Attachments: screenshot-1.png > > > The search function of paged table is not available because of we don't skip > the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
[ https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-19622: - Attachment: screenshot-1.png > Fix a http error in a paged table when using a `Go` button to search. > - > > Key: SPARK-19622 > URL: https://issues.apache.org/jira/browse/SPARK-19622 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Minor > Attachments: screenshot-1.png > > > The search function of paged table is not available because of we don't skip > the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.
StanZhai created SPARK-19622: Summary: Fix a http error in a paged table when using a `Go` button to search. Key: SPARK-19622 URL: https://issues.apache.org/jira/browse/SPARK-19622 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: StanZhai Priority: Minor The search function of paged table is not available because of we don't skip the hash data of the reqeust path. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists
[ https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869368#comment-15869368 ] Eyal Zituny commented on SPARK-19594: - that will work but i will have to remove the "final" from the "postToAll" method which is part of spark core another option can be to change the method post(event: StreamingQueryListener.Event): def post(event: StreamingQueryListener.Event) { event match { case s: QueryStartedEvent => activeQueryRunIds.synchronized { activeQueryRunIds += s.runId } sparkListenerBus.post(s) // post to local listeners to trigger callbacks postToAll(s) case t: QueryTerminatedEvent => // run all the listeners synchronized before removing the id from the list postToAll(t) activeQueryRunIds.synchronized { activeQueryRunIds -= t.runId } case _ => sparkListenerBus.post(event) } } > StreamingQueryListener fails to handle QueryTerminatedEvent if more then one > listeners exists > - > > Key: SPARK-19594 > URL: https://issues.apache.org/jira/browse/SPARK-19594 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Eyal Zituny >Priority: Minor > > reproduce: > *create a spark session > *add multiple streaming query listeners > *create a simple query > *stop the query > result -> only the first listener handle the QueryTerminatedEvent > this might happen because the query run id is being removed from > activeQueryRunIds once the onQueryTerminated is called > (StreamingQueryListenerBus:115) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19621) R Windows AppVeyor test should run CRAN checks
Felix Cheung created SPARK-19621: Summary: R Windows AppVeyor test should run CRAN checks Key: SPARK-19621 URL: https://issues.apache.org/jira/browse/SPARK-19621 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung We should run CRAN checks (see check-cran.sh) even on Windows since cross-platform tests is part of the CRAN release requirement. check-cran.sh however is a bash script as of now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19618. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16948 [https://github.com/apache/spark/pull/16948] > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > Fix For: 2.2.0 > > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19618: --- Assignee: Tejas Patil > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Tejas Patil > Fix For: 2.2.0 > > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19619) SparkR approxQuantile supports input multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869337#comment-15869337 ] Felix Cheung commented on SPARK-19619: -- dup of SPARK-18285 > SparkR approxQuantile supports input multiple columns > - > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Priority: Minor > > SparkR approxQuantile supports input multiple columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869336#comment-15869336 ] Apache Spark commented on SPARK-19620: -- User 'carsonwang' has created a pull request for this issue: https://github.com/apache/spark/pull/16952 > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Priority: Minor > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19620: Assignee: Apache Spark > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Assignee: Apache Spark >Priority: Minor > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19620: Assignee: (was: Apache Spark) > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Priority: Minor > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
Carson Wang created SPARK-19620: --- Summary: Incorrect exchange coordinator Id in physical plan Key: SPARK-19620 URL: https://issues.apache.org/jira/browse/SPARK-19620 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Carson Wang Priority: Minor When adaptive execution is enabled, an exchange coordinator is used to in the Exchange operators. For Join, the same exchange coordinator is used for its two Exchanges. But the physical plan shows two different coordinator Ids which is confusing. Here is an example: {code} == Physical Plan == *Project [key1#3L, value2#12L] +- *SortMergeJoin [key1#3L], [key2#11L], Inner :- *Sort [key1#3L ASC NULLS FIRST], false, 0 : +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), coordinator[target post-shuffle partition size: 67108864] : +- *Project [(id#0L % 500) AS key1#3L] :+- *Filter isnotnull((id#0L % 500)) : +- *Range (0, 1000, step=1, splits=Some(10)) +- *Sort [key2#11L ASC NULLS FIRST], false, 0 +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), coordinator[target post-shuffle partition size: 67108864] +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] +- *Filter isnotnull((id#8L % 500)) +- *Range (0, 1000, step=1, splits=Some(10)) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869314#comment-15869314 ] Tejas Patil commented on SPARK-19326: - > You might be able to just write an `if` case that checks whether speculation > is enabled and run some logic in the listener to detect speculated tasks. For ExecutorAllocationManager to detect that there needs to be speculation, it would basically would have to duplicate what TaskSetManager does to find candidates for speculation (unless you have some better way). Thats bad because there would be two entities making decisions about speculation. > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19619) SparkR approxQuantile supports input multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-19619: Description: SparkR approxQuantile supports input multiple columns. (was: SparkR approxQuantile support multiple columns) > SparkR approxQuantile supports input multiple columns > - > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Priority: Minor > > SparkR approxQuantile supports input multiple columns. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19619) SparkR approxQuantile support multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19619: Assignee: (was: Apache Spark) > SparkR approxQuantile support multiple columns > -- > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Priority: Minor > > SparkR approxQuantile support multiple columns -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19619) SparkR approxQuantile support multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19619: Assignee: Apache Spark > SparkR approxQuantile support multiple columns > -- > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > SparkR approxQuantile support multiple columns -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19619) SparkR approxQuantile supports input multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-19619: Summary: SparkR approxQuantile supports input multiple columns (was: SparkR approxQuantile support multiple columns) > SparkR approxQuantile supports input multiple columns > - > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Priority: Minor > > SparkR approxQuantile support multiple columns -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19619) SparkR approxQuantile support multiple columns
[ https://issues.apache.org/jira/browse/SPARK-19619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869283#comment-15869283 ] Apache Spark commented on SPARK-19619: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/16951 > SparkR approxQuantile support multiple columns > -- > > Key: SPARK-19619 > URL: https://issues.apache.org/jira/browse/SPARK-19619 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Yanbo Liang >Priority: Minor > > SparkR approxQuantile support multiple columns -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19619) SparkR approxQuantile support multiple columns
Yanbo Liang created SPARK-19619: --- Summary: SparkR approxQuantile support multiple columns Key: SPARK-19619 URL: https://issues.apache.org/jira/browse/SPARK-19619 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.0 Reporter: Yanbo Liang Priority: Minor SparkR approxQuantile support multiple columns -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869244#comment-15869244 ] Andrew Or commented on SPARK-19326: --- I would say it's a bad idea to make ExecutorAllocationManager talk to the TaskSetManager. The existing listener interface is relatively isolated. I'm not sure if you need to introduce a new event to capture speculation. You might be able to just write an `if` case that checks whether speculation is enabled and run some logic in the listener to detect speculated tasks. > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19603) Fix StreamingQuery explain command
[ https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19603. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Fix StreamingQuery explain command > -- > > Key: SPARK-19603 > URL: https://issues.apache.org/jira/browse/SPARK-19603 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > Right now StreamingQuery.explain doesn't show the correct streaming physical > plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869112#comment-15869112 ] Tejas Patil commented on SPARK-19326: - Thanks for the info !! [~andrewor14] / [~kayousterhout] : I am happy to work on this. Two approaches I can think of are: - Add an event in listener to inform `ExecutorAllocationManager` about tasks from speculation. - `ExecutorAllocationManager` should not be depending on listener and have some other event based mechanism to drive have communication between `ExecutorAllocationManager` and `TaskSetManager`. This is cleaner solution but it will be bigger change. What do you think ? > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19399) R Coalesce on DataFrame and coalesce on column
[ https://issues.apache.org/jira/browse/SPARK-19399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869067#comment-15869067 ] Apache Spark commented on SPARK-19399: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16950 > R Coalesce on DataFrame and coalesce on column > -- > > Key: SPARK-19399 > URL: https://issues.apache.org/jira/browse/SPARK-19399 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.1, 2.2.0 > > > coalesce on DataFrame is different from repartition, where shuffling is > avoided. We should have that in SparkR. > coalesce on Column is convenient to have in expression. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application
[ https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16122: Assignee: (was: Apache Spark) > Spark History Server REST API missing an environment endpoint per application > - > > Key: SPARK-16122 > URL: https://issues.apache.org/jira/browse/SPARK-16122 > Project: Spark > Issue Type: New Feature > Components: Documentation, Web UI >Affects Versions: 1.6.1 >Reporter: Neelesh Srinivas Salian >Priority: Minor > Labels: Docs, WebUI > > The WebUI for the Spark History Server has the Environment tab that allows > you to view the Environment for that job. > With Runtime , Spark properties...etc. > How about adding an endpoint to the REST API that looks and points to this > environment tab for that application? > /applications/[app-id]/environment > Added Docs too so that we can spawn a subsequent Documentation addition to > get it included in the API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application
[ https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16122: Assignee: Apache Spark > Spark History Server REST API missing an environment endpoint per application > - > > Key: SPARK-16122 > URL: https://issues.apache.org/jira/browse/SPARK-16122 > Project: Spark > Issue Type: New Feature > Components: Documentation, Web UI >Affects Versions: 1.6.1 >Reporter: Neelesh Srinivas Salian >Assignee: Apache Spark >Priority: Minor > Labels: Docs, WebUI > > The WebUI for the Spark History Server has the Environment tab that allows > you to view the Environment for that job. > With Runtime , Spark properties...etc. > How about adding an endpoint to the REST API that looks and points to this > environment tab for that application? > /applications/[app-id]/environment > Added Docs too so that we can spawn a subsequent Documentation addition to > get it included in the API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application
[ https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869062#comment-15869062 ] Apache Spark commented on SPARK-16122: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/16949 > Spark History Server REST API missing an environment endpoint per application > - > > Key: SPARK-16122 > URL: https://issues.apache.org/jira/browse/SPARK-16122 > Project: Spark > Issue Type: New Feature > Components: Documentation, Web UI >Affects Versions: 1.6.1 >Reporter: Neelesh Srinivas Salian >Priority: Minor > Labels: Docs, WebUI > > The WebUI for the Spark History Server has the Environment tab that allows > you to view the Environment for that job. > With Runtime , Spark properties...etc. > How about adding an endpoint to the REST API that looks and points to this > environment tab for that application? > /applications/[app-id]/environment > Added Docs too so that we can spawn a subsequent Documentation addition to > get it included in the API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19460: - Yes- it's better to address the root issue with column name but it wouldn't hurt to avoid confusing everyone by not using iris every where. > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869023#comment-15869023 ] Yin Huai commented on SPARK-19604: -- It has been resolved by https://github.com/apache/spark/pull/16935. > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.3, 2.1.1 > > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-19604. -- Resolution: Fixed Fix Version/s: 2.1.1 2.0.3 > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.3, 2.1.1 > > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869020#comment-15869020 ] Andrew Or commented on SPARK-19326: --- Sorry for slipping on this. When I was implementing the feature the goal was to get it working for normal cases first, so I wouldn't be surprised if it doesn't work with speculation. I don't think there's a fundamental reason why it can't be supported. Someone just needs to implement it. > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867124#comment-15867124 ] xukun edited comment on SPARK-18113 at 2/16/17 1:58 AM: [~aash] According my scenario and [https://github.com/palantir/spark/pull/94] code task 678.0 outputCommitCoordinator.canCommit will match CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) => CommitState(attemptNumber, System.nanoTime(), MidCommit) outputCommitCoordinator.commitDone match CommitState(existingCommitter, startTime, MidCommit) if attemptNumber == existingCommitter => CommitState(attemptNumber, startTime, Committed) task 678.1 outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, Committed) If executor is preempted after outputCommitCoordinator.commitDone, driver still enter into task deadloop was (Author: xukun): [~aash] According my scenario and [https://github.com/palantir/spark/pull/94] code task 678.0 outputCommitCoordinator.canCommit will match CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) => CommitState(attemptNumber, System.nanoTime(), MidCommit) outputCommitCoordinator.commitDone match CommitState(existingCommitter, startTime, MidCommit) if attemptNumber == existingCommitter => CommitState(attemptNumber, startTime, Committed) task 678.1 outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, Committed) then driver enter into task deadloop > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing >Assignee: jin xing > Fix For: 2.2.0 > > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at >
[jira] [Assigned] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19618: Assignee: (was: Apache Spark) > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868963#comment-15868963 ] Apache Spark commented on SPARK-19618: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/16948 > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
[ https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19618: Assignee: Apache Spark > Inconsistency wrt max. buckets allowed from Dataframe API vs SQL > > > Key: SPARK-19618 > URL: https://issues.apache.org/jira/browse/SPARK-19618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Apache Spark > > High number of buckets is allowed while creating a table via SQL query: > {code} > sparkSession.sql(""" > CREATE TABLE bucketed_table(col1 INT) USING parquet > CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS > """) > sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) > > [Num Buckets:,147483647,] > [Bucket Columns:,[col1],] > [Sort Columns:,[col1],] > > {code} > Trying the same via dataframe API does not work: > {code} > > df.write.format("orc").bucketBy(147483647, > > "j","k").sortBy("j","k").saveAsTable("bucketed_table") > java.lang.IllegalArgumentException: requirement failed: Bucket number must be > greater than 0 and less than 10. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) > at > org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
Tejas Patil created SPARK-19618: --- Summary: Inconsistency wrt max. buckets allowed from Dataframe API vs SQL Key: SPARK-19618 URL: https://issues.apache.org/jira/browse/SPARK-19618 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Tejas Patil High number of buckets is allowed while creating a table via SQL query: {code} sparkSession.sql(""" CREATE TABLE bucketed_table(col1 INT) USING parquet CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS """) sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println) [Num Buckets:,147483647,] [Bucket Columns:,[col1],] [Sort Columns:,[col1],] {code} Trying the same via dataframe API does not work: {code} > df.write.format("orc").bucketBy(147483647, > "j","k").sortBy("j","k").saveAsTable("bucketed_table") java.lang.IllegalArgumentException: requirement failed: Bucket number must be greater than 0 and less than 10. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293) at org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291) at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410) at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365) ... 50 elided {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084
[ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19617: Assignee: Shixiong Zhu (was: Apache Spark) > Fix a case that a query may not stop due to HADOOP-14084 > > > Key: SPARK-19617 > URL: https://issues.apache.org/jira/browse/SPARK-19617 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Saw the following exception in some test log: > {code} > 17/02/14 21:20:10.987 stream execution thread for this_query [id = > 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = > a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining > on: Thread[Thread-48,5,main] > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) > at org.apache.hadoop.util.Shell.run(Shell.java:479) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75) > at > org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46) > at > org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59) > at > org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) > {code} > This is the cause of some test timeout failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084
[ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868913#comment-15868913 ] Apache Spark commented on SPARK-19617: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16947 > Fix a case that a query may not stop due to HADOOP-14084 > > > Key: SPARK-19617 > URL: https://issues.apache.org/jira/browse/SPARK-19617 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Saw the following exception in some test log: > {code} > 17/02/14 21:20:10.987 stream execution thread for this_query [id = > 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = > a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining > on: Thread[Thread-48,5,main] > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) > at org.apache.hadoop.util.Shell.run(Shell.java:479) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75) > at > org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46) > at > org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59) > at > org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) > {code} > This is the cause of some test timeout failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868914#comment-15868914 ] Miao Wang commented on SPARK-19460: --- By the way, I remembered that you had discussion about fixing the underlying issue on some PR review. > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084
[ https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19617: Assignee: Apache Spark (was: Shixiong Zhu) > Fix a case that a query may not stop due to HADOOP-14084 > > > Key: SPARK-19617 > URL: https://issues.apache.org/jira/browse/SPARK-19617 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Saw the following exception in some test log: > {code} > 17/02/14 21:20:10.987 stream execution thread for this_query [id = > 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = > a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining > on: Thread[Thread-48,5,main] > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1249) > at java.lang.Thread.join(Thread.java:1323) > at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) > at org.apache.hadoop.util.Shell.run(Shell.java:479) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) > at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75) > at > org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46) > at > org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59) > at > org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) > at > org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) > {code} > This is the cause of some test timeout failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868911#comment-15868911 ] Miao Wang commented on SPARK-19460: --- Seems a lots of work. :) I can give a try. > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19617) Fix a case that a query may not stop due to HADOOP-14084
Shixiong Zhu created SPARK-19617: Summary: Fix a case that a query may not stop due to HADOOP-14084 Key: SPARK-19617 URL: https://issues.apache.org/jira/browse/SPARK-19617 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0, 2.0.2 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Saw the following exception in some test log: {code} 17/02/14 21:20:10.987 stream execution thread for this_query [id = 09fd5d6d-bea3-4891-88c7-0d0f1909188d, runId = a564cb52-bc3d-47f1-8baf-7e0e4fa79a5e] WARN Shell: Interrupted while joining on: Thread[Thread-48,5,main] java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1249) at java.lang.Thread.join(Thread.java:1323) at org.apache.hadoop.util.Shell.joinThread(Shell.java:626) at org.apache.hadoop.util.Shell.runCommand(Shell.java:577) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.util.Shell.execCommand(Shell.java:866) at org.apache.hadoop.util.Shell.execCommand(Shell.java:849) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:491) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1066) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:176) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:197) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:730) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:726) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:733) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.mkdirs(HDFSMetadataLog.scala:385) at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.(HDFSMetadataLog.scala:75) at org.apache.spark.sql.execution.streaming.CompactibleFileStreamLog.(CompactibleFileStreamLog.scala:46) at org.apache.spark.sql.execution.streaming.FileStreamSourceLog.(FileStreamSourceLog.scala:36) at org.apache.spark.sql.execution.streaming.FileStreamSource.(FileStreamSource.scala:59) at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:246) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:145) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:141) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:141) at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:136) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:191) {code} This is the cause of some test timeout failures on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868889#comment-15868889 ] Tejas Patil commented on SPARK-19326: - [~andrewor14] : Ping !! > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API
[ https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868887#comment-15868887 ] Yanbo Liang edited comment on SPARK-18080 at 2/16/17 12:32 AM: --- [~josephkb] I'm sorry that I did not notice that you are shepherding this task, and I have committed the PR. I will take a look in advance next time. Thanks. was (Author: yanboliang): [~josephkb] I'm sorry that I did not notice that you are shepherding this task, and I have committed it. I will take a look in advance next time. Thanks. > Locality Sensitive Hashing (LSH) Python API > --- > > Key: SPARK-18080 > URL: https://issues.apache.org/jira/browse/SPARK-18080 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Yun Ni > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API
[ https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-18080. - Resolution: Fixed Fix Version/s: 2.2.0 > Locality Sensitive Hashing (LSH) Python API > --- > > Key: SPARK-18080 > URL: https://issues.apache.org/jira/browse/SPARK-18080 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Yun Ni > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API
[ https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868887#comment-15868887 ] Yanbo Liang commented on SPARK-18080: - [~josephkb] I'm sorry that I did not notice that you are shepherding this task, and I have committed it. I will take a look in advance next time. Thanks. > Locality Sensitive Hashing (LSH) Python API > --- > > Key: SPARK-18080 > URL: https://issues.apache.org/jira/browse/SPARK-18080 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19599) Clean up HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19599. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.2.0 2.1.1 > Clean up HDFSMetadataLog > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog > Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us > from removing the workaround codes. Anyway, I sill did some clean up and also > updated the comments to point to HADOOP-14084. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled
[ https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19554: Assignee: (was: Apache Spark) > YARN backend should use history server URL for tracking when UI is disabled > --- > > Key: SPARK-19554 > URL: https://issues.apache.org/jira/browse/SPARK-19554 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Currently, if the app has disabled its UI, Spark does not set a tracking URL > in YARN. The UI is still available, even if with a lag, in the history > server, if it's configured. We should use that as the tracking URL in these > cases, instead of letting YARN show its default page for applications without > a UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled
[ https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19554: Assignee: Apache Spark > YARN backend should use history server URL for tracking when UI is disabled > --- > > Key: SPARK-19554 > URL: https://issues.apache.org/jira/browse/SPARK-19554 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > Currently, if the app has disabled its UI, Spark does not set a tracking URL > in YARN. The UI is still available, even if with a lag, in the history > server, if it's configured. We should use that as the tracking URL in these > cases, instead of letting YARN show its default page for applications without > a UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled
[ https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868843#comment-15868843 ] Apache Spark commented on SPARK-19554: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/16946 > YARN backend should use history server URL for tracking when UI is disabled > --- > > Key: SPARK-19554 > URL: https://issues.apache.org/jira/browse/SPARK-19554 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Currently, if the app has disabled its UI, Spark does not set a tracking URL > in YARN. The UI is still available, even if with a lag, in the history > server, if it's configured. We should use that as the tracking URL in these > cases, instead of letting YARN show its default page for applications without > a UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs
[ https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19616: Assignee: Apache Spark > weightCol and aggregationDepth should be improved for some SparkR APIs > --- > > Key: SPARK-19616 > URL: https://issues.apache.org/jira/browse/SPARK-19616 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Miao Wang >Assignee: Apache Spark >Priority: Minor > > When doing SPARK-19456, we found that "" should be consider a NULL column > name and should not be set. aggregationDepth should be exposed as an expert > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs
[ https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868782#comment-15868782 ] Apache Spark commented on SPARK-19616: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/16945 > weightCol and aggregationDepth should be improved for some SparkR APIs > --- > > Key: SPARK-19616 > URL: https://issues.apache.org/jira/browse/SPARK-19616 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Miao Wang >Priority: Minor > > When doing SPARK-19456, we found that "" should be consider a NULL column > name and should not be set. aggregationDepth should be exposed as an expert > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs
[ https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19616: Assignee: (was: Apache Spark) > weightCol and aggregationDepth should be improved for some SparkR APIs > --- > > Key: SPARK-19616 > URL: https://issues.apache.org/jira/browse/SPARK-19616 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Miao Wang >Priority: Minor > > When doing SPARK-19456, we found that "" should be consider a NULL column > name and should not be set. aggregationDepth should be exposed as an expert > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs
Miao Wang created SPARK-19616: - Summary: weightCol and aggregationDepth should be improved for some SparkR APIs Key: SPARK-19616 URL: https://issues.apache.org/jira/browse/SPARK-19616 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.1.0, 2.2.0 Reporter: Miao Wang Priority: Minor When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: (was: Design_ColResolution_JIRA19602.docx) > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.docx > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: Design_ColResolution_JIRA19602.docx > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.docx, > Design_ColResolution_JIRA19602.docx > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19615) Provide Dataset union convenience for divergent schema
Nick Dimiduk created SPARK-19615: Summary: Provide Dataset union convenience for divergent schema Key: SPARK-19615 URL: https://issues.apache.org/jira/browse/SPARK-19615 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Nick Dimiduk Priority: Minor Creating a union DataFrame over two sources that have different schema definitions is surprisingly complex. Provide a version of the union method that will create a infer a target schema as the result of merging the sources. Automatically add extend either side with {{null}} columns for any missing columns that are nullable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19614) add type-preserving null function
Nick Dimiduk created SPARK-19614: Summary: add type-preserving null function Key: SPARK-19614 URL: https://issues.apache.org/jira/browse/SPARK-19614 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Nick Dimiduk Priority: Trivial There's currently no easy way to extend the columns of a DataFrame with null columns that also preserves the type. {{lit(null)}} evaluates to {{Literal(null, NullType)}}, despite any subsequent hinting, for instance with {{Column.as(String, Metadata)}}. This comes up when programmatically munging data from disparate sources. A function such as {{null(DataType)}} would be nice. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19497) dropDuplicates with watermark
[ https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868736#comment-15868736 ] sam elamin commented on SPARK-19497: I would love to be able to help on this [~zsxwing], please do get in touch if there is anything I can do > dropDuplicates with watermark > - > > Key: SPARK-19497 > URL: https://issues.apache.org/jira/browse/SPARK-19497 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Shixiong Zhu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19610) multi line support for CSV
[ https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868699#comment-15868699 ] Hyukjin Kwon commented on SPARK-19610: -- Sure, let me try. Thanks for cc'ing me. > multi line support for CSV > -- > > Key: SPARK-19610 > URL: https://issues.apache.org/jira/browse/SPARK-19610 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19610) multi line support for CSV
[ https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868700#comment-15868700 ] Hyukjin Kwon commented on SPARK-19610: -- Sure, let me try. Thanks for cc'ing me. > multi line support for CSV > -- > > Key: SPARK-19610 > URL: https://issues.apache.org/jira/browse/SPARK-19610 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19610) multi line support for CSV
[ https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19610: - Comment: was deleted (was: Sure, let me try. Thanks for cc'ing me.) > multi line support for CSV > -- > > Key: SPARK-19610 > URL: https://issues.apache.org/jira/browse/SPARK-19610 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868677#comment-15868677 ] Apache Spark commented on SPARK-19611: -- User 'budde' has created a pull request for this issue: https://github.com/apache/spark/pull/16942 > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868666#comment-15868666 ] Apache Spark commented on SPARK-19611: -- User 'budde' has created a pull request for this issue: https://github.com/apache/spark/pull/16944 > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability
Kay Ousterhout created SPARK-19613: -- Summary: Flaky test: StateStoreRDDSuite.versioning and immutability Key: SPARK-19613 URL: https://issues.apache.org/jira/browse/SPARK-19613 Project: Spark Issue Type: Bug Components: Structured Streaming, Tests Affects Versions: 2.1.1 Reporter: Kay Ousterhout Priority: Minor This test: org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning and immutability failed on a recent PR: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18937) Timezone support in CSV/JSON parsing
[ https://issues.apache.org/jira/browse/SPARK-18937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-18937: --- Assignee: Takuya Ueshin > Timezone support in CSV/JSON parsing > > > Key: SPARK-18937 > URL: https://issues.apache.org/jira/browse/SPARK-18937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19599: - Description: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. Anyway, I sill did some clean up to make HDFSMetadataLog simply. was: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. > Clean up HDFSMetadataLog > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog > Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us > from removing the workaround codes. Anyway, I sill did some clean up to make > HDFSMetadataLog simply. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19599: - Description: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. Anyway, I sill did some clean up and also updated the comments to point to HADOOP-14084. was: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. Anyway, I sill did some clean up to make HDFSMetadataLog simple. > Clean up HDFSMetadataLog > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog > Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us > from removing the workaround codes. Anyway, I sill did some clean up and also > updated the comments to point to HADOOP-14084. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18937) Timezone support in CSV/JSON parsing
[ https://issues.apache.org/jira/browse/SPARK-18937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18937. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16750 [https://github.com/apache/spark/pull/16750] > Timezone support in CSV/JSON parsing > > > Key: SPARK-18937 > URL: https://issues.apache.org/jira/browse/SPARK-18937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19599: - Description: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. Anyway, I sill did some clean up to make HDFSMetadataLog simple. was: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. Anyway, I sill did some clean up to make HDFSMetadataLog simply. > Clean up HDFSMetadataLog > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog > Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us > from removing the workaround codes. Anyway, I sill did some clean up to make > HDFSMetadataLog simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19599: - Summary: Clean up HDFSMetadataLog (was: Clean up HDFSMetadataLog for Hadoop 2.6+) > Clean up HDFSMetadataLog > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog > Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us > from removing the workaround codes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19599: - Description: SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us from removing the workaround codes. was:SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog > Clean up HDFSMetadataLog for Hadoop 2.6+ > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog > Updated: Unfortunately, there is another issue HADOOP-14084 that prevents us > from removing the workaround codes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception
[ https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19329. - Resolution: Fixed Assignee: Song Jun Fix Version/s: 2.2.0 > after alter a datasource table's location to a not exist location and then > insert data throw Exception > -- > > Key: SPARK-19329 > URL: https://issues.apache.org/jira/browse/SPARK-19329 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun >Assignee: Song Jun > Fix For: 2.2.0 > > > spark.sql("create table t(a string, b int) using parquet") > spark.sql(s"alter table t set location '$notexistedlocation'") > spark.sql("insert into table t select 'c', 1") > this will throw an exception: > com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.spark.sql.AnalysisException: Path does not exist: > $notexistedlocation; > at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) > at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) > at > org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868565#comment-15868565 ] Kay Ousterhout commented on SPARK-19612: Does that mean we could potentially fix this by limiting the concurrency on Jenkins? > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.1 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.
[ https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868552#comment-15868552 ] Sean Owen commented on SPARK-17689: --- This is created by for example HDFS copy jobs to hold the files before they are fully written. It exists transiently and could stick around if something failed. > _temporary files breaks the Spark SQL streaming job. > > > Key: SPARK-17689 > URL: https://issues.apache.org/jira/browse/SPARK-17689 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Prashant Sharma > > Steps to reproduce: > 1) Start a streaming job which reads from HDFS location hdfs://xyz/* > 2) Write content to hdfs://xyz/a > . > . > repeat a few times. > And then job breaks as follows. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in > stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage > 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not > exist: hdfs://localhost:9000/input/t5/_temporary > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868549#comment-15868549 ] Sean Owen commented on SPARK-19612: --- I think this happens when Jenkins is quite busy; it probably isn't even a flaky test situation. That has been my experience. Not that it isn't a problem but may not be due to a test per se. > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.1 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19607) Finding QueryExecution that matches provided executionId
[ https://issues.apache.org/jira/browse/SPARK-19607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868532#comment-15868532 ] Apache Spark commented on SPARK-19607: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16943 > Finding QueryExecution that matches provided executionId > > > Key: SPARK-19607 > URL: https://issues.apache.org/jira/browse/SPARK-19607 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ala Luszczak >Assignee: Ala Luszczak > Fix For: 2.2.0 > > > Create a method for finding QueryExecution that matches provided executionId > for future use. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19584) Update Structured Streaming documentation to include Batch query description
[ https://issues.apache.org/jira/browse/SPARK-19584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-19584: - Assignee: Tyson Condie > Update Structured Streaming documentation to include Batch query description > > > Key: SPARK-19584 > URL: https://issues.apache.org/jira/browse/SPARK-19584 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Tyson Condie >Assignee: Tyson Condie > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19492) Dataset, filter and pattern matching on elements
[ https://issues.apache.org/jira/browse/SPARK-19492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868503#comment-15868503 ] Niek Bartholomeus commented on SPARK-19492: --- I'm having this issue since starting to use spark a year ago. I thought it was a minor issue that would get solved in the next update but it's still there in 2.1.0. The workaround is indeed to create a val func as described above or even simpler to wrap it with a match clause: {code} departments.filter{ x => x match {case Department(_, name)=> name == "hr" }} {code} > Dataset, filter and pattern matching on elements > > > Key: SPARK-19492 > URL: https://issues.apache.org/jira/browse/SPARK-19492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Loic Descotte >Priority: Minor > > It seems it is impossible to use pattern matching to define input parameters > for function filter on datasets. > Example : > This one is working : > {code} > val departments = Seq( > Department(1, "hr"), > Department(2, "it") > ).toDS > departments.filter{ d=> > d.name == "hr" > } > {code} > but not this one : > {code} > departments.filter{ case Department(_, name)=> > name == "hr" > } > {code} > Error : > {code} > error: missing parameter type for expanded function > The argument types of an anonymous function must be fully known. (SLS 8.5) > Expected type was: ? > departments.filter{ case Department(_, name)=> > {code} > This kind of pattern matching should work (as departements dataset type is > known) like Scala collections filter function, or RDD filter function for > example. > Please note that it works on map function : > {code} > departments.map{ case Department(_, name)=> > name > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19612) Tests failing with timeout
Kay Ousterhout created SPARK-19612: -- Summary: Tests failing with timeout Key: SPARK-19612 URL: https://issues.apache.org/jira/browse/SPARK-19612 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.1 Reporter: Kay Ousterhout Priority: Minor I've seen at least one recent test failure due to hitting the 250m timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ Filing this JIRA to track this; if it happens repeatedly we should up the timeout. cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists
[ https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868469#comment-15868469 ] Shixiong Zhu commented on SPARK-19594: -- I suggest that overriding "def postToAll(event: E)" and remove the query id after all listeners process the event. > StreamingQueryListener fails to handle QueryTerminatedEvent if more then one > listeners exists > - > > Key: SPARK-19594 > URL: https://issues.apache.org/jira/browse/SPARK-19594 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Eyal Zituny >Priority: Minor > > reproduce: > *create a spark session > *add multiple streaming query listeners > *create a simple query > *stop the query > result -> only the first listener handle the QueryTerminatedEvent > this might happen because the query run id is being removed from > activeQueryRunIds once the onQueryTerminated is called > (StreamingQueryListenerBus:115) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17689) _temporary files breaks the Spark SQL streaming job.
[ https://issues.apache.org/jira/browse/SPARK-17689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868466#comment-15868466 ] Shixiong Zhu commented on SPARK-17689: -- Just curious: who created "_temporary"? > _temporary files breaks the Spark SQL streaming job. > > > Key: SPARK-17689 > URL: https://issues.apache.org/jira/browse/SPARK-17689 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Prashant Sharma > > Steps to reproduce: > 1) Start a streaming job which reads from HDFS location hdfs://xyz/* > 2) Write content to hdfs://xyz/a > . > . > repeat a few times. > And then job breaks as follows. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in > stage 304.0 failed 1 times, most recent failure: Lost task 49.0 in stage > 304.0 (TID 14794, localhost): java.io.FileNotFoundException: File does not > exist: hdfs://localhost:9000/input/t5/_temporary > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) > at > org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:464) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$4.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1919) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19455) Add option for case-insensitive Parquet field resolution
[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868457#comment-15868457 ] Adam Budde commented on SPARK-19455: Closing this in favor of https://issues.apache.org/jira/browse/SPARK-19611 > Add option for case-insensitive Parquet field resolution > > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19455) Add option for case-insensitive Parquet field resolution
[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Budde closed SPARK-19455. -- Resolution: Duplicate Closing in favor of https://issues.apache.org/jira/browse/SPARK-19611 > Add option for case-insensitive Parquet field resolution > > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19611: Assignee: Apache Spark > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde >Assignee: Apache Spark > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19455) Add option for case-insensitive Parquet field resolution
[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Budde updated SPARK-19455: --- Comment: was deleted (was: Closing this in favor of https://issues.apache.org/jira/browse/SPARK-19611) > Add option for case-insensitive Parquet field resolution > > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868458#comment-15868458 ] Apache Spark commented on SPARK-19611: -- User 'budde' has created a pull request for this issue: https://github.com/apache/spark/pull/16942 > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19611: Assignee: (was: Apache Spark) > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19610) multi line support for CSV
[ https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868424#comment-15868424 ] Wenchen Fan commented on SPARK-19610: - [~hyukjin.kwon] do you have time to work on it? > multi line support for CSV > -- > > Key: SPARK-19610 > URL: https://issues.apache.org/jira/browse/SPARK-19610 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19610) multi line support for CSV
Wenchen Fan created SPARK-19610: --- Summary: multi line support for CSV Key: SPARK-19610 URL: https://issues.apache.org/jira/browse/SPARK-19610 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
Adam Budde created SPARK-19611: -- Summary: Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files Key: SPARK-19611 URL: https://issues.apache.org/jira/browse/SPARK-19611 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Adam Budde This issue replaces [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR #16797|https://github.com/apache/spark/pull/16797] [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the schema inferrence from the HiveMetastoreCatalog class when converting a MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in favor of simply using the schema returend by the metastore. This results in an optimization as the underlying file status no longer need to be resolved until after the partition pruning step, reducing the number of files to be touched significantly in some cases. The downside is that the data schema used may no longer match the underlying file schema for case-sensitive formats such as Parquet. Unfortunately, this silently breaks queries over tables where the underlying data fields are case-sensitive but a case-sensitive schema wasn't written to the table properties by Spark. This situation will occur for any Hive table that wasn't created by Spark or that was created prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive field name in the query projection or in the query filter, the query will return 0 results in every case. The change we are proposing is to bring back the schema inference that was used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the table properties. - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive schema can be read from the table properties. Attempt to save the inferred schema in the table properties to avoid future inference. - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but don't attempt to save it. - NEVER_INFER: Fall back to using the case-insensitive schema returned by the Hive Metatore. Useful if the user knows that none of the underlying data is case-sensitive. See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19568) Must include class/method documentation for CRAN check
[ https://issues.apache.org/jira/browse/SPARK-19568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868376#comment-15868376 ] Felix Cheung commented on SPARK-19568: -- that would be great - it looks like nightly build is a Jenkins config - I don't find anything in the git repo on how that is setup > Must include class/method documentation for CRAN check > -- > > Key: SPARK-19568 > URL: https://issues.apache.org/jira/browse/SPARK-19568 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > > While tests are running, R CMD check --as-cran is still complaining > {code} > * checking for missing documentation entries ... WARNING > Undocumented code objects: > ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’ ‘arrange’ > ‘array_contains’ ‘as.DataFrame’ ‘as.data.frame’ ‘asc’ ‘ascii’ ‘avg’ > ‘base64’ ‘between’ ‘bin’ ‘bitwiseNOT’ ‘bround’ ‘cache’ ‘cacheTable’ > ‘cancelJobGroup’ ‘cast’ ‘cbrt’ ‘ceil’ ‘clearCache’ ‘clearJobGroup’ > ‘collect’ ‘colnames’ ‘colnames<-’ ‘coltypes’ ‘coltypes<-’ ‘column’ > ‘columns’ ‘concat’ ‘concat_ws’ ‘contains’ ‘conv’ ‘corr’ ‘count’ > ‘countDistinct’ ‘cov’ ‘covar_pop’ ‘covar_samp’ ‘crc32’ > ‘createDataFrame’ ‘createExternalTable’ ‘createOrReplaceTempView’ > ‘crossJoin’ ‘crosstab’ ‘cume_dist’ ‘dapply’ ‘dapplyCollect’ > ‘date_add’ ‘date_format’ ‘date_sub’ ‘datediff’ ‘dayofmonth’ > ‘dayofyear’ ‘decode’ ‘dense_rank’ ‘desc’ ‘describe’ ‘distinct’ ‘drop’ > ... > {code} > This is because of lack of .Rd files in a clean environment when running > against the content of the R source package. > I think we need to generate the .Rd files under man\ when building the > release and then package with them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12957) Derive and propagate data constrains in logical plan
[ https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868366#comment-15868366 ] Nick Dimiduk commented on SPARK-12957: -- Filed SPARK-19609. IMHO, it would be another subtask on this ticket. > Derive and propagate data constrains in logical plan > - > > Key: SPARK-12957 > URL: https://issues.apache.org/jira/browse/SPARK-12957 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > Attachments: ConstraintPropagationinSparkSQL.pdf > > > Based on the semantic of a query plan, we can derive data constrains (e.g. if > a filter defines {{a > 10}}, we know that the output data of this filter > satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a > framework to derive and propagate constrains in the logical plan, which can > help us to build more advanced optimizations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org