[jira] [Created] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
Eric Liang created SPARK-18184: -- Summary: INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations Key: SPARK-18184 URL: https://issues.apache.org/jira/browse/SPARK-18184 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang As part of https://issues.apache.org/jira/browse/SPARK-17861 we should support this case as well now that Datasource table partitions can have custom locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[ https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18183: --- Component/s: SQL > INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource > table instead of just the specified partition > --- > > Key: SPARK-18183 > URL: https://issues.apache.org/jira/browse/SPARK-18183 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > In Hive, this will only overwrite the specified partition. We should fix this > for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached
[ https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17972: - Target Version/s: 2.1.0 (was: 2.0.2, 2.1.0) > Query planning slows down dramatically for large query plans even when > sub-trees are cached > --- > > Key: SPARK-17972 > URL: https://issues.apache.org/jira/browse/SPARK-17972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following Spark shell snippet creates a series of query plans that grow > exponentially. The {{i}}-th plan is created using 4 *cached* copies of the > {{i - 1}}-th plan. > {code} > (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) => > val start = System.currentTimeMillis() > val result = plan.join(plan, "value").join(plan, "value").join(plan, > "value").join(plan, "value") > result.cache() > System.out.println(s"Iteration $iteration takes time > ${System.currentTimeMillis() - start} ms") > result.as[Int] > } > {code} > We can see that although all plans are cached, the query planning time still > grows exponentially and quickly becomes unbearable. > {noformat} > Iteration 0 takes time 9 ms > Iteration 1 takes time 19 ms > Iteration 2 takes time 61 ms > Iteration 3 takes time 219 ms > Iteration 4 takes time 830 ms > Iteration 5 takes time 4080 ms > {noformat} > Similar scenarios can be found in iterative ML code and significantly affects > usability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623661#comment-15623661 ] Seth Hendrickson commented on SPARK-15784: -- This seems like it fits the framework of a feature transformer. We could generate a real-valued feature column using PIC algorithm where the values are just the components of the pseudo-eigenvector. Alternatively we could pipeline a KMeans clustering on the end, but I think it makes more sense to let users do that themselves - but that's up for debate. > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18167) Flaky test when hive partition pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18167. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15701 [https://github.com/apache/spark/pull/15701] > Flaky test when hive partition pruning is enabled > - > > Key: SPARK-18167 > URL: https://issues.apache.org/jira/browse/SPARK-18167 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > Fix For: 2.1.0 > > > org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive > partition pruning is enabled. > Based on the stack traces, it seems to be an old issue where Hive fails to > cast a numeric partition column ("Invalid character string format for type > DECIMAL"). There are two possibilities here: either we are somehow corrupting > the partition table to have non-decimal values in that column, or there is a > transient issue with Derby. > {code} > Error Message java.lang.reflect.InvocationTargetException: null Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null >at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) >at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769) > at > org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) >at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) >at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at >
[jira] [Updated] (SPARK-18167) Flaky test when hive partition pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-18167: - Assignee: Eric Liang > Flaky test when hive partition pruning is enabled > - > > Key: SPARK-18167 > URL: https://issues.apache.org/jira/browse/SPARK-18167 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > > org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive > partition pruning is enabled. > Based on the stack traces, it seems to be an old issue where Hive fails to > cast a numeric partition column ("Invalid character string format for type > DECIMAL"). There are two possibilities here: either we are somehow corrupting > the partition table to have non-decimal values in that column, or there is a > transient issue with Derby. > {code} > Error Message java.lang.reflect.InvocationTargetException: null Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null >at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) >at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769) > at > org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) >at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) >at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at >
[jira] [Created] (SPARK-18185) Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
Eric Liang created SPARK-18185: -- Summary: Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions Key: SPARK-18185 URL: https://issues.apache.org/jira/browse/SPARK-18185 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the updated partitions as in Hive. This is non-trivial to fix in 2.1, so we should throw an exception here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18167) Flaky test when hive partition pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623499#comment-15623499 ] Apache Spark commented on SPARK-18167: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15701 > Flaky test when hive partition pruning is enabled > - > > Key: SPARK-18167 > URL: https://issues.apache.org/jira/browse/SPARK-18167 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive > partition pruning is enabled. > Based on the stack traces, it seems to be an old issue where Hive fails to > cast a numeric partition column ("Invalid character string format for type > DECIMAL"). There are two possibilities here: either we are somehow corrupting > the partition table to have non-decimal values in that column, or there is a > transient issue with Derby. > {code} > Error Message java.lang.reflect.InvocationTargetException: null Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null >at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) >at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769) > at > org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) >at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) >at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at >
[jira] [Created] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
Cheng Lian created SPARK-18186: -- Summary: Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support Key: SPARK-18186 URL: https://issues.apache.org/jira/browse/SPARK-18186 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1, 1.6.2 Reporter: Cheng Lian Assignee: Cheng Lian Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} without partial aggregation. This issue can be fixed by migrating {{HiveUDAFFunction}} to {{TypedImperativeAggregate}}, which already provides partial aggregation support for aggregate functions that may use arbitrary Java objects as aggregation states. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators
[ https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623707#comment-15623707 ] Apache Spark commented on SPARK-17732: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15704 > ALTER TABLE DROP PARTITION should support comparators > - > > Key: SPARK-17732 > URL: https://issues.apache.org/jira/browse/SPARK-17732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun > > This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in > Apache Spark 2.0 for backward compatibility. > *Spark 1.6.2* > {code} > scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, > quarter STRING)") > res0: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')") > res1: org.apache.spark.sql.DataFrame = [result: string] > {code} > *Spark 2.0* > {code} > scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, > quarter STRING)") > res0: org.apache.spark.sql.DataFrame = [] > scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting {')', ','}(line 1, pos 42) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17964) Enable SparkR with Mesos client mode
[ https://issues.apache.org/jira/browse/SPARK-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623427#comment-15623427 ] Apache Spark commented on SPARK-17964: -- User 'susanxhuynh' has created a pull request for this issue: https://github.com/apache/spark/pull/15700 > Enable SparkR with Mesos client mode > > > Key: SPARK-17964 > URL: https://issues.apache.org/jira/browse/SPARK-17964 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17964) Enable SparkR with Mesos client mode
[ https://issues.apache.org/jira/browse/SPARK-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17964: Assignee: Apache Spark > Enable SparkR with Mesos client mode > > > Key: SPARK-17964 > URL: https://issues.apache.org/jira/browse/SPARK-17964 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Sun Rui >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18186: Assignee: Apache Spark (was: Cheng Lian) > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Apache Spark > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18186: Assignee: Cheng Lian (was: Apache Spark) > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support
[ https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623688#comment-15623688 ] Apache Spark commented on SPARK-18186: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/15703 > Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation > support > > > Key: SPARK-18186 > URL: https://issues.apache.org/jira/browse/SPARK-18186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any > query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} > without partial aggregation. > This issue can be fixed by migrating {{HiveUDAFFunction}} to > {{TypedImperativeAggregate}}, which already provides partial aggregation > support for aggregate functions that may use arbitrary Java objects as > aggregation states. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18030. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 > Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite > - > > Key: SPARK-18030 > URL: https://issues.apache.org/jira/browse/SPARK-18030 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Davies Liu >Assignee: Shixiong Zhu > Fix For: 2.0.2, 2.1.0 > > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18188) Add checksum for block in Spark
Davies Liu created SPARK-18188: -- Summary: Add checksum for block in Spark Key: SPARK-18188 URL: https://issues.apache.org/jira/browse/SPARK-18188 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There is an understanding issue for a long time: https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the blocks, it's very hard for us to identify where is the bug came from. We should have a way the check a block from remote node or disk that is correct or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch
Shixiong Zhu created SPARK-18187: Summary: CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch Key: SPARK-18187 URL: https://issues.apache.org/jira/browse/SPARK-18187 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.1 Reporter: Shixiong Zhu Right now CompactibleFileStreamLog uses compactInterval to check if a batch is a compaction batch. However, since this conf is controlled by the user, they may just change it, and CompactibleFileStreamLog will just silently return the wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[ https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18183: Assignee: (was: Apache Spark) > INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource > table instead of just the specified partition > --- > > Key: SPARK-18183 > URL: https://issues.apache.org/jira/browse/SPARK-18183 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > In Hive, this will only overwrite the specified partition. We should fix this > for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[ https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18183: Assignee: Apache Spark > INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource > table instead of just the specified partition > --- > > Key: SPARK-18183 > URL: https://issues.apache.org/jira/browse/SPARK-18183 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > > In Hive, this will only overwrite the specified partition. We should fix this > for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
[ https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623811#comment-15623811 ] Apache Spark commented on SPARK-18184: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15705 > INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot > handle partitions with custom locations > > > Key: SPARK-18184 > URL: https://issues.apache.org/jira/browse/SPARK-18184 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As part of https://issues.apache.org/jira/browse/SPARK-17861 we should > support this case as well now that Datasource table partitions can have > custom locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
[ https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18184: Assignee: Apache Spark > INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot > handle partitions with custom locations > > > Key: SPARK-18184 > URL: https://issues.apache.org/jira/browse/SPARK-18184 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > > As part of https://issues.apache.org/jira/browse/SPARK-17861 we should > support this case as well now that Datasource table partitions can have > custom locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
[ https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18184: Assignee: (was: Apache Spark) > INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot > handle partitions with custom locations > > > Key: SPARK-18184 > URL: https://issues.apache.org/jira/browse/SPARK-18184 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As part of https://issues.apache.org/jira/browse/SPARK-17861 we should > support this case as well now that Datasource table partitions can have > custom locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[ https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623809#comment-15623809 ] Apache Spark commented on SPARK-18183: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15705 > INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource > table instead of just the specified partition > --- > > Key: SPARK-18183 > URL: https://issues.apache.org/jira/browse/SPARK-18183 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > In Hive, this will only overwrite the specified partition. We should fix this > for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17964) Enable SparkR with Mesos client mode
[ https://issues.apache.org/jira/browse/SPARK-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17964: Assignee: (was: Apache Spark) > Enable SparkR with Mesos client mode > > > Key: SPARK-17964 > URL: https://issues.apache.org/jira/browse/SPARK-17964 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18143. - Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.1.0 2.0.3 > History Server is broken because of the refactoring work in Structured > Streaming > > > Key: SPARK-18143 > URL: https://issues.apache.org/jira/browse/SPARK-18143 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.2 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > Fix For: 2.0.3, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
Eric Liang created SPARK-18183: -- Summary: INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition Key: SPARK-18183 URL: https://issues.apache.org/jira/browse/SPARK-18183 Project: Spark Issue Type: Bug Reporter: Eric Liang In Hive, this will only overwrite the specified partition. We should fix this for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached
[ https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623363#comment-15623363 ] Yin Huai commented on SPARK-17972: -- This issue has been resolved by https://github.com/apache/spark/pull/15651. > Query planning slows down dramatically for large query plans even when > sub-trees are cached > --- > > Key: SPARK-17972 > URL: https://issues.apache.org/jira/browse/SPARK-17972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.1.0 > > > The following Spark shell snippet creates a series of query plans that grow > exponentially. The {{i}}-th plan is created using 4 *cached* copies of the > {{i - 1}}-th plan. > {code} > (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) => > val start = System.currentTimeMillis() > val result = plan.join(plan, "value").join(plan, "value").join(plan, > "value").join(plan, "value") > result.cache() > System.out.println(s"Iteration $iteration takes time > ${System.currentTimeMillis() - start} ms") > result.as[Int] > } > {code} > We can see that although all plans are cached, the query planning time still > grows exponentially and quickly becomes unbearable. > {noformat} > Iteration 0 takes time 9 ms > Iteration 1 takes time 19 ms > Iteration 2 takes time 61 ms > Iteration 3 takes time 219 ms > Iteration 4 takes time 830 ms > Iteration 5 takes time 4080 ms > {noformat} > Similar scenarios can be found in iterative ML code and significantly affects > usability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached
[ https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17972. -- Resolution: Fixed Fix Version/s: 2.1.0 > Query planning slows down dramatically for large query plans even when > sub-trees are cached > --- > > Key: SPARK-17972 > URL: https://issues.apache.org/jira/browse/SPARK-17972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.1 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.1.0 > > > The following Spark shell snippet creates a series of query plans that grow > exponentially. The {{i}}-th plan is created using 4 *cached* copies of the > {{i - 1}}-th plan. > {code} > (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) => > val start = System.currentTimeMillis() > val result = plan.join(plan, "value").join(plan, "value").join(plan, > "value").join(plan, "value") > result.cache() > System.out.println(s"Iteration $iteration takes time > ${System.currentTimeMillis() - start} ms") > result.as[Int] > } > {code} > We can see that although all plans are cached, the query planning time still > grows exponentially and quickly becomes unbearable. > {noformat} > Iteration 0 takes time 9 ms > Iteration 1 takes time 19 ms > Iteration 2 takes time 61 ms > Iteration 3 takes time 219 ms > Iteration 4 takes time 830 ms > Iteration 5 takes time 4080 ms > {noformat} > Similar scenarios can be found in iterative ML code and significantly affects > usability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18124) Implement watermarking for handling late data
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623603#comment-15623603 ] Apache Spark commented on SPARK-18124: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/15702 > Implement watermarking for handling late data > - > > Key: SPARK-18124 > URL: https://issues.apache.org/jira/browse/SPARK-18124 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Michael Armbrust > > Whenever we aggregate data by event time, we want to consider data is late > and out-of-order in terms of its event time. Since we keep aggregate keyed by > the time as state, the state will grow unbounded if we keep around all old > aggregates in an attempt consider arbitrarily late data. Since the state is a > store in-memory, we have to prevent building up of this unbounded state. > Hence, we need a watermarking mechanism by which we will mark data that is > older beyond a threshold as “too late”, and stop updating the aggregates with > them. This would allow us to remove old aggregates that are never going to be > updated, thus bounding the size of the state. > Here is the design doc - > https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18124) Implement watermarking for handling late data
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18124: Assignee: Apache Spark (was: Michael Armbrust) > Implement watermarking for handling late data > - > > Key: SPARK-18124 > URL: https://issues.apache.org/jira/browse/SPARK-18124 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > Whenever we aggregate data by event time, we want to consider data is late > and out-of-order in terms of its event time. Since we keep aggregate keyed by > the time as state, the state will grow unbounded if we keep around all old > aggregates in an attempt consider arbitrarily late data. Since the state is a > store in-memory, we have to prevent building up of this unbounded state. > Hence, we need a watermarking mechanism by which we will mark data that is > older beyond a threshold as “too late”, and stop updating the aggregates with > them. This would allow us to remove old aggregates that are never going to be > updated, thus bounding the size of the state. > Here is the design doc - > https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18124) Implement watermarking for handling late data
[ https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18124: Assignee: Michael Armbrust (was: Apache Spark) > Implement watermarking for handling late data > - > > Key: SPARK-18124 > URL: https://issues.apache.org/jira/browse/SPARK-18124 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Michael Armbrust > > Whenever we aggregate data by event time, we want to consider data is late > and out-of-order in terms of its event time. Since we keep aggregate keyed by > the time as state, the state will grow unbounded if we keep around all old > aggregates in an attempt consider arbitrarily late data. Since the state is a > store in-memory, we have to prevent building up of this unbounded state. > Hence, we need a watermarking mechanism by which we will mark data that is > older beyond a threshold as “too late”, and stop updating the aggregates with > them. This would allow us to remove old aggregates that are never going to be > updated, thus bounding the size of the state. > Here is the design doc - > https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15472) Add support for writing partitioned `csv`, `json`, `text` formats in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624475#comment-15624475 ] Apache Spark commented on SPARK-15472: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/13575 > Add support for writing partitioned `csv`, `json`, `text` formats in > Structured Streaming > - > > Key: SPARK-15472 > URL: https://issues.apache.org/jira/browse/SPARK-15472 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin(Inactive) >Assignee: Reynold Xin > > Support for partitioned `parquet` format in FileStreamSink was added in > Spark-14716, now let's add support for partitioned `csv`, 'json', `text` > format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false
[ https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17992: Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > HiveClient.getPartitionsByFilter throws an exception for some unsupported > filters when hive.metastore.try.direct.sql=false > -- > > Key: SPARK-17992 > URL: https://issues.apache.org/jira/browse/SPARK-17992 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Allman > > We recently added (and enabled by default) table partition pruning for > partitioned Hive tables converted to using {{TableFileCatalog}}. When the > Hive configuration option {{hive.metastore.try.direct.sql}} is set to > {{false}}, Hive will throw an exception for unsupported filter expressions. > For example, attempting to filter on an integer partition column will throw a > {{org.apache.hadoop.hive.metastore.api.MetaException}}. > I discovered this behavior because VideoAmp uses the CDH version of Hive with > a Postgresql metastore DB. In this configuration, CDH sets > {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that > filter on a non-string partition column will fail. That would be a rather > rude surprise for these Spark 2.1 users... > I'm not sure exactly what behavior we should expect, but I suggest that > {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and > return all partitions instead. This is what Spark does for Hive 0.12 users, > which does not support this feature at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
[ https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18184: Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot > handle partitions with custom locations > > > Key: SPARK-18184 > URL: https://issues.apache.org/jira/browse/SPARK-18184 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > As part of https://issues.apache.org/jira/browse/SPARK-17861 we should > support this case as well now that Datasource table partitions can have > custom locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18173) data source tables should support truncating partition
[ https://issues.apache.org/jira/browse/SPARK-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18173: Issue Type: Sub-task (was: New Feature) Parent: SPARK-17861 > data source tables should support truncating partition > -- > > Key: SPARK-18173 > URL: https://issues.apache.org/jira/browse/SPARK-18173 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624114#comment-15624114 ] Vincent commented on SPARK-17055: - [~srowen] May I ask the reason why we close this issue? It'd be helpful for us to understand current guideline if we are to implement more features in ML/MLLIB, thanks. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
Yang Yang created SPARK-18189: - Summary: task not serializable with groupByKey() + mapGroups() + map Key: SPARK-18189 URL: https://issues.apache.org/jira/browse/SPARK-18189 Project: Spark Issue Type: Bug Reporter: Yang Yang just run the following code val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] val grouped = a.groupByKey({x:(Int,Int)=>x._1}) val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=>{ val simpley = yyy.value 1 }) spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[ https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18183: Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource > table instead of just the specified partition > --- > > Key: SPARK-18183 > URL: https://issues.apache.org/jira/browse/SPARK-18183 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > In Hive, this will only overwrite the specified partition. We should fix this > for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12469) Data Property Accumulators for Spark (formerly Consistent Accumulators)
[ https://issues.apache.org/jira/browse/SPARK-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12469: Target Version/s: (was: 2.1.0) > Data Property Accumulators for Spark (formerly Consistent Accumulators) > --- > > Key: SPARK-12469 > URL: https://issues.apache.org/jira/browse/SPARK-12469 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: holdenk > > Tasks executed on Spark workers are unable to modify values from the driver, > and accumulators are the one exception for this. Accumulators in Spark are > implemented in such a way that when a stage is recomputed (say for cache > eviction) the accumulator will be updated a second time. This makes > accumulators inside of transformations more difficult to use for things like > counting invalid records (one of the primary potential use cases of > collecting side information during a transformation). However in some cases > this counting during re-evaluation is exactly the behaviour we want (say in > tracking total execution time for a particular function). Spark would benefit > from a version of accumulators which did not double count even if stages were > re-executed. > Motivating example: > {code} > val parseTime = sc.accumulator(0L) > val parseFailures = sc.accumulator(0L) > val parsedData = sc.textFile(...).flatMap { line => > val start = System.currentTimeMillis() > val parsed = Try(parse(line)) > if (parsed.isFailure) parseFailures += 1 > parseTime += System.currentTimeMillis() - start > parsed.toOption > } > parsedData.cache() > val resultA = parsedData.map(...).filter(...).count() > // some intervening code. Almost anything could happen here -- some of > parsedData may > // get kicked out of the cache, or an executor where data was cached might > get lost > val resultB = parsedData.filter(...).map(...).flatMap(...).count() > // now we look at the accumulators > {code} > Here we would want parseFailures to only have been added to once for every > line which failed to parse. Unfortunately, the current Spark accumulator API > doesn’t support the current parseFailures use case since if some data had > been evicted its possible that it will be double counted. > See the full design document at > https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18025) Port streaming to use the commit protocol API
[ https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624420#comment-15624420 ] Apache Spark commented on SPARK-18025: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15710 > Port streaming to use the commit protocol API > - > > Key: SPARK-18025 > URL: https://issues.apache.org/jira/browse/SPARK-18025 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18025) Port streaming to use the commit protocol API
[ https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18025: Assignee: (was: Apache Spark) > Port streaming to use the commit protocol API > - > > Key: SPARK-18025 > URL: https://issues.apache.org/jira/browse/SPARK-18025 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18025) Port streaming to use the commit protocol API
[ https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18025: Assignee: Apache Spark > Port streaming to use the commit protocol API > - > > Key: SPARK-18025 > URL: https://issues.apache.org/jira/browse/SPARK-18025 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18190) Fix R version to not the latest in AppVeyor
Hyukjin Kwon created SPARK-18190: Summary: Fix R version to not the latest in AppVeyor Key: SPARK-18190 URL: https://issues.apache.org/jira/browse/SPARK-18190 Project: Spark Issue Type: Improvement Components: Build, SparkR Reporter: Hyukjin Kwon Currently, Spark supports the test on Windows via AppVeyor but not it seems failing to download R 3.3.1 after R 3.3.2 is released. It downloads given R version after checking if that is the latest or not via http://rversions.r-pkg.org/r-release because the URL. For example, the latest one has the URL as below: https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe and the old one has the URL as below. https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe The problem is, it seems the versions of R on Windows are not always synced with the latest versions. Please check https://cloud.r-project.org So, currently, AppVeyor tries to find https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which is the URL for old versions) as 3.3.2 is released but does not exist because it seems R 3.3.2 for Windows is not there. It seems safer to lower the version as SparkR supports 3.1+ if I remember correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ergin Seyfe updated SPARK-18189: Component/s: SQL > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yang Yang > > just run the following code > {code} > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > {code} > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18190) Fix R version to not the latest in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18190: Assignee: (was: Apache Spark) > Fix R version to not the latest in AppVeyor > --- > > Key: SPARK-18190 > URL: https://issues.apache.org/jira/browse/SPARK-18190 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Reporter: Hyukjin Kwon > > Currently, Spark supports the test on Windows via AppVeyor but not it seems > failing to download R 3.3.1 after R 3.3.2 is released. > It downloads given R version after checking if that is the latest or not via > http://rversions.r-pkg.org/r-release because the URL. > For example, the latest one has the URL as below: > https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe > and the old one has the URL as below. > https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe > The problem is, it seems the versions of R on Windows are not always synced > with the latest versions. > Please check https://cloud.r-project.org > So, currently, AppVeyor tries to find > https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which > is the URL for old versions) as 3.3.2 is released but does not exist because > it seems R 3.3.2 for Windows is not there. > It seems safer to lower the version as SparkR supports 3.1+ if I remember > correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18190) Fix R version to not the latest in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624236#comment-15624236 ] Apache Spark commented on SPARK-18190: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15709 > Fix R version to not the latest in AppVeyor > --- > > Key: SPARK-18190 > URL: https://issues.apache.org/jira/browse/SPARK-18190 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Reporter: Hyukjin Kwon > > Currently, Spark supports the test on Windows via AppVeyor but not it seems > failing to download R 3.3.1 after R 3.3.2 is released. > It downloads given R version after checking if that is the latest or not via > http://rversions.r-pkg.org/r-release because the URL. > For example, the latest one has the URL as below: > https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe > and the old one has the URL as below. > https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe > The problem is, it seems the versions of R on Windows are not always synced > with the latest versions. > Please check https://cloud.r-project.org > So, currently, AppVeyor tries to find > https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which > is the URL for old versions) as 3.3.2 is released but does not exist because > it seems R 3.3.2 for Windows is not there. > It seems safer to lower the version as SparkR supports 3.1+ if I remember > correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16827) Stop reporting spill metrics as shuffle metrics
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16827: Target Version/s: 2.0.3, 2.1.0 Fix Version/s: (was: 2.0.2) (was: 2.1.0) > Stop reporting spill metrics as shuffle metrics > --- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Brian Cho > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18087: Target Version/s: 2.1.0 > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17637) Packed scheduling for Spark tasks across executors
[ https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17637: Affects Version/s: (was: 2.1.0) Target Version/s: 2.1.0 > Packed scheduling for Spark tasks across executors > -- > > Key: SPARK-17637 > URL: https://issues.apache.org/jira/browse/SPARK-17637 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Zhan Zhang >Assignee: Zhan Zhang >Priority: Minor > > Currently Spark scheduler implements round robin scheduling for tasks to > executors. Which is great as it distributes the load evenly across the > cluster, but this leads to significant resource waste in some cases, > especially when dynamic allocation is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18024) Introduce a commit protocol API along with OutputCommitter implementation
[ https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18024. - Resolution: Fixed Fix Version/s: 2.1.0 > Introduce a commit protocol API along with OutputCommitter implementation > - > > Key: SPARK-18024 > URL: https://issues.apache.org/jira/browse/SPARK-18024 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > This commit protocol API should wrap around Hadoop's output committer. Later > we can expand the API to cover streaming commits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623869#comment-15623869 ] Apache Spark commented on SPARK-18189: -- User 'seyfe' has created a pull request for this issue: https://github.com/apache/spark/pull/15706 > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug >Reporter: Yang Yang > > just run the following code > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18189: Assignee: Apache Spark > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug >Reporter: Yang Yang >Assignee: Apache Spark > > just run the following code > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18189: Assignee: (was: Apache Spark) > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug >Reporter: Yang Yang > > just run the following code > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18189: Target Version/s: 2.1.0 > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug >Reporter: Yang Yang > > just run the following code > {code} > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > {code} > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18189: Target Version/s: 2.0.3, 2.1.0 (was: 2.1.0) > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug >Reporter: Yang Yang > > just run the following code > {code} > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > {code} > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce
[ https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14393: Target Version/s: 2.1.0 > monotonicallyIncreasingId not monotonically increasing with downstream > coalesce > --- > > Key: SPARK-14393 > URL: https://issues.apache.org/jira/browse/SPARK-14393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0, 2.0.1 >Reporter: Jason Piper >Assignee: Xiangrui Meng > Labels: correctness > > When utilising monotonicallyIncreasingId with a coalesce, it appears that > every partition uses the same offset (0) leading to non-monotonically > increasing IDs. > See examples below > {code} > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show() > +---+ > |monotonicallyincreasingid()| > +---+ > |25769803776| > |51539607552| > |77309411328| > | 103079215104| > | 128849018880| > | 163208757248| > | 188978561024| > | 214748364800| > | 240518168576| > | 266287972352| > +---+ > >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > | 0| > +---+ > >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show() > +---+ > |monotonicallyincreasingid()| > +---+ > | 0| > | 1| > | 0| > | 0| > | 1| > | 2| > | 3| > | 0| > | 1| > | 2| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18087) Optimize insert to not require REPAIR TABLE
[ https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18087. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 > Optimize insert to not require REPAIR TABLE > --- > > Key: SPARK-18087 > URL: https://issues.apache.org/jira/browse/SPARK-18087 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18190) Fix R version to not the latest in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18190: Assignee: Apache Spark > Fix R version to not the latest in AppVeyor > --- > > Key: SPARK-18190 > URL: https://issues.apache.org/jira/browse/SPARK-18190 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > Currently, Spark supports the test on Windows via AppVeyor but not it seems > failing to download R 3.3.1 after R 3.3.2 is released. > It downloads given R version after checking if that is the latest or not via > http://rversions.r-pkg.org/r-release because the URL. > For example, the latest one has the URL as below: > https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe > and the old one has the URL as below. > https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe > The problem is, it seems the versions of R on Windows are not always synced > with the latest versions. > Please check https://cloud.r-project.org > So, currently, AppVeyor tries to find > https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which > is the URL for old versions) as 3.3.2 is released but does not exist because > it seems R 3.3.2 for Windows is not there. > It seems safer to lower the version as SparkR supports 3.1+ if I remember > correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18189: Description: just run the following code {code} val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] val grouped = a.groupByKey({x:(Int,Int)=>x._1}) val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=>{ val simpley = yyy.value 1 }) {code} spark says Task not serializable was: just run the following code val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] val grouped = a.groupByKey({x:(Int,Int)=>x._1}) val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) val yyy = sc.broadcast(1) val last = mappedGroups.rdd.map(xx=>{ val simpley = yyy.value 1 }) spark says Task not serializable > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug >Reporter: Yang Yang > > just run the following code > {code} > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > {code} > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18167) Flaky test when hive partition pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623959#comment-15623959 ] Apache Spark commented on SPARK-18167: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/15708 > Flaky test when hive partition pruning is enabled > - > > Key: SPARK-18167 > URL: https://issues.apache.org/jira/browse/SPARK-18167 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > > org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive > partition pruning is enabled. > Based on the stack traces, it seems to be an old issue where Hive fails to > cast a numeric partition column ("Invalid character string format for type > DECIMAL"). There are two possibilities here: either we are somehow corrupting > the partition table to have non-decimal values in that column, or there is a > transient issue with Derby. > {code} > Error Message java.lang.reflect.InvocationTargetException: null Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null >at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) >at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769) > at > org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) >at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) >at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) at >
[jira] [Commented] (SPARK-18024) Introduce a commit protocol API along with OutputCommitter implementation
[ https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623957#comment-15623957 ] Apache Spark commented on SPARK-18024: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15707 > Introduce a commit protocol API along with OutputCommitter implementation > - > > Key: SPARK-18024 > URL: https://issues.apache.org/jira/browse/SPARK-18024 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This commit protocol API should wrap around Hadoop's output committer. Later > we can expand the API to cover streaming commits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18190) Fix R version to not the latest in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624176#comment-15624176 ] Shivaram Venkataraman commented on SPARK-18190: --- Yeah using a fixed version sounds good to me. > Fix R version to not the latest in AppVeyor > --- > > Key: SPARK-18190 > URL: https://issues.apache.org/jira/browse/SPARK-18190 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Reporter: Hyukjin Kwon > > Currently, Spark supports the test on Windows via AppVeyor but not it seems > failing to download R 3.3.1 after R 3.3.2 is released. > It downloads given R version after checking if that is the latest or not via > http://rversions.r-pkg.org/r-release because the URL. > For example, the latest one has the URL as below: > https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe > and the old one has the URL as below. > https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe > The problem is, it seems the versions of R on Windows are not always synced > with the latest versions. > Please check https://cloud.r-project.org > So, currently, AppVeyor tries to find > https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which > is the URL for old versions) as 3.3.2 is released but does not exist because > it seems R 3.3.2 for Windows is not there. > It seems safer to lower the version as SparkR supports 3.1+ if I remember > correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18160) spark.files should not passed to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-18160: --- Summary: spark.files should not passed to driver in yarn-cluster mode (was: SparkContext.addFile doesn't work in yarn-cluster mode) > spark.files should not passed to driver in yarn-cluster mode > > > Key: SPARK-18160 > URL: https://issues.apache.org/jira/browse/SPARK-18160 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.6.2, 2.0.1 >Reporter: Jeff Zhang >Priority: Critical > > The following command will fails for spark 2.0 > {noformat} > bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml > examples/target/original-spark-examples_2.11.jar > {noformat} > The above command can reproduce the error as following in a multiple node > cluster. To be noticed, this issue only happens in multiple node cluster. As > in the single node cluster, AM use the same local filesystem as the the > driver. > {noformat} > 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. > java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) > at > org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) > at > org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.apache.spark.SparkContext.(SparkContext.scala:462) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18160) spark.files should not be passed to driver in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-18160: --- Summary: spark.files should not be passed to driver in yarn-cluster mode (was: spark.files should not passed to driver in yarn-cluster mode) > spark.files should not be passed to driver in yarn-cluster mode > --- > > Key: SPARK-18160 > URL: https://issues.apache.org/jira/browse/SPARK-18160 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.6.2, 2.0.1 >Reporter: Jeff Zhang >Priority: Critical > > The following command will fails for spark 2.0 > {noformat} > bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml > examples/target/original-spark-examples_2.11.jar > {noformat} > The above command can reproduce the error as following in a multiple node > cluster. To be noticed, this issue only happens in multiple node cluster. As > in the single node cluster, AM use the same local filesystem as the the > driver. > {noformat} > 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. > java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) > at > org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) > at > org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.apache.spark.SparkContext.(SparkContext.scala:462) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch
[ https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624263#comment-15624263 ] Liwei Lin commented on SPARK-18187: --- hi [~zsxwing] how are we planning to fix this? thanks > CompactibleFileStreamLog should not rely on "compactInterval" to detect a > compaction batch > -- > > Key: SPARK-18187 > URL: https://issues.apache.org/jira/browse/SPARK-18187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.1 >Reporter: Shixiong Zhu > > Right now CompactibleFileStreamLog uses compactInterval to check if a batch > is a compaction batch. However, since this conf is controlled by the user, > they may just change it, and CompactibleFileStreamLog will just silently > return the wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-18030: Assignee: Shixiong Zhu (was: Tathagata Das) > Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite > - > > Key: SPARK-18030 > URL: https://issues.apache.org/jira/browse/SPARK-18030 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Davies Liu >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623225#comment-15623225 ] Apache Spark commented on SPARK-18030: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15699 > Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite > - > > Key: SPARK-18030 > URL: https://issues.apache.org/jira/browse/SPARK-18030 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Davies Liu >Assignee: Shixiong Zhu > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-18160: --- Description: The following command will fails for spark 2.0 {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} and this command fails for spark 1.6 {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --files /usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} The above command can reproduce the error as following in a multiple node cluster. To be noticed, this issue only happens in multiple node cluster. As in the single node cluster, AM use the same local filesystem as the the driver. {noformat} 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.SparkContext.(SparkContext.scala:462) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) {noformat} was: {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} The above command can reproduce the error as following in a multiple node cluster. To be noticed, this issue only happens in multiple node cluster. As in the single node cluster, AM use the same local filesystem as the the driver. {noformat} 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.SparkContext.(SparkContext.scala:462) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at
[jira] [Updated] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-18160: --- Description: The following command will fails for spark 2.0 {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} The above command can reproduce the error as following in a multiple node cluster. To be noticed, this issue only happens in multiple node cluster. As in the single node cluster, AM use the same local filesystem as the the driver. {noformat} 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.SparkContext.(SparkContext.scala:462) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) {noformat} was: The following command will fails for spark 2.0 {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} and this command fails for spark 1.6 {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --files /usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} The above command can reproduce the error as following in a multiple node cluster. To be noticed, this issue only happens in multiple node cluster. As in the single node cluster, AM use the same local filesystem as the the driver. {noformat} 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.SparkContext.(SparkContext.scala:462) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at
[jira] [Commented] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621651#comment-15621651 ] Apache Spark commented on SPARK-18177: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/15692 > Add missing 'subsamplingRate' of pyspark GBTClassifier > -- > > Key: SPARK-18177 > URL: https://issues.apache.org/jira/browse/SPARK-18177 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: zhengruifeng > > Add missing 'subsamplingRate' of pyspark GBTClassifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18177: Assignee: (was: Apache Spark) > Add missing 'subsamplingRate' of pyspark GBTClassifier > -- > > Key: SPARK-18177 > URL: https://issues.apache.org/jira/browse/SPARK-18177 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: zhengruifeng > > Add missing 'subsamplingRate' of pyspark GBTClassifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18177: Assignee: Apache Spark > Add missing 'subsamplingRate' of pyspark GBTClassifier > -- > > Key: SPARK-18177 > URL: https://issues.apache.org/jira/browse/SPARK-18177 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: zhengruifeng >Assignee: Apache Spark > > Add missing 'subsamplingRate' of pyspark GBTClassifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map
[ https://issues.apache.org/jira/browse/SPARK-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18176: -- Priority: Minor (was: Major) > Kafka010 .createRDD() scala API should expect scala Map > --- > > Key: SPARK-18176 > URL: https://issues.apache.org/jira/browse/SPARK-18176 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liwei Lin >Priority: Minor > > Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} > and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of > {{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}. > But please note, this is a public API change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17055. --- Resolution: Won't Fix > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-882) Have link for feedback/suggestions in docs
[ https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-882. - Resolution: Not A Problem > Have link for feedback/suggestions in docs > -- > > Key: SPARK-882 > URL: https://issues.apache.org/jira/browse/SPARK-882 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Patrick Wendell >Assignee: Patrick Cogan >Priority: Minor > > It would be cool to have a link at the top of the docs for > feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from > that and it could be a good way to crowdsource correctness checking, since a > lot of us that write them never have to use them. > Something to the right of the main top nav might be good. [~andyk] [~matei] - > what do you guys think? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/31/16 7:54 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will be created in function *addBatch*. * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/31/16 7:53 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will be created in function *addBatch*. * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will be created in function *addBatch*. * *KafkaSinkRDD* *KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to get or create producer to send data * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier
zhengruifeng created SPARK-18177: Summary: Add missing 'subsamplingRate' of pyspark GBTClassifier Key: SPARK-18177 URL: https://issues.apache.org/jira/browse/SPARK-18177 Project: Spark Issue Type: Bug Components: ML, PySpark Reporter: zhengruifeng Add missing 'subsamplingRate' of pyspark GBTClassifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/31/16 8:04 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is created once in per jvm for the same producer paragrams. Data will be send to kafka cluster distributedly which is like *ForeachSink* * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is created once in a jvm for the same producer paragrams. Data will be send to kafka cluster distributedly which is like *ForeachSink* * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/31/16 8:03 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is created once in a jvm for the same producer paragrams. Data will be send to kafka cluster distributedly which is like *ForeachSink* * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. Like *ForeachSink*, dataset will be transformed to RDD, so that data can be send to kafka cluster distributedly * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used
[ https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621681#comment-15621681 ] Apache Spark commented on SPARK-18125: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15693 > Spark generated code causes CompileException when groupByKey, reduceGroups > and map(_._2) are used > - > > Key: SPARK-18125 > URL: https://issues.apache.org/jira/browse/SPARK-18125 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 >Reporter: Ray Qiu >Priority: Critical > > Code logic looks like this: > {noformat} > .groupByKey > .reduceGroups > .map(_._2) > {noformat} > Works fine with 2.0.0. > 2.0.1 error Message: > {noformat} > Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 206, Column 123: Unknown variable or type "value4" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificMutableProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private java.lang.String errMsg; > /* 011 */ private java.lang.String errMsg1; > /* 012 */ private boolean MapObjects_loopIsNull1; > /* 013 */ private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0; > /* 014 */ private java.lang.String errMsg2; > /* 015 */ private Object[] values1; > /* 016 */ private boolean MapObjects_loopIsNull3; > /* 017 */ private java.lang.String MapObjects_loopValue2; > /* 018 */ private boolean isNull_0; > /* 019 */ private boolean value_0; > /* 020 */ private boolean isNull_1; > /* 021 */ private InternalRow value_1; > /* 022 */ > /* 023 */ private void apply_4(InternalRow i) { > /* 024 */ > /* 025 */ boolean isNull52 = MapObjects_loopIsNull1; > /* 026 */ final double value52 = isNull52 ? -1.0 : > MapObjects_loopValue0.ts(); > /* 027 */ if (isNull52) { > /* 028 */ values1[8] = null; > /* 029 */ } else { > /* 030 */ values1[8] = value52; > /* 031 */ } > /* 032 */ boolean isNull54 = MapObjects_loopIsNull1; > /* 033 */ final java.lang.String value54 = isNull54 ? null : > (java.lang.String) MapObjects_loopValue0.uid(); > /* 034 */ isNull54 = value54 == null; > /* 035 */ boolean isNull53 = isNull54; > /* 036 */ final UTF8String value53 = isNull53 ? null : > org.apache.spark.unsafe.types.UTF8String.fromString(value54); > /* 037 */ isNull53 = value53 == null; > /* 038 */ if (isNull53) { > /* 039 */ values1[9] = null; > /* 040 */ } else { > /* 041 */ values1[9] = value53; > /* 042 */ } > /* 043 */ boolean isNull56 = MapObjects_loopIsNull1; > /* 044 */ final java.lang.String value56 = isNull56 ? null : > (java.lang.String) MapObjects_loopValue0.src(); > /* 045 */ isNull56 = value56 == null; > /* 046 */ boolean isNull55 = isNull56; > /* 047 */ final UTF8String value55 = isNull55 ? null : > org.apache.spark.unsafe.types.UTF8String.fromString(value56); > /* 048 */ isNull55 = value55 == null; > /* 049 */ if (isNull55) { > /* 050 */ values1[10] = null; > /* 051 */ } else { > /* 052 */ values1[10] = value55; > /* 053 */ } > /* 054 */ } > /* 055 */ > /* 056 */ > /* 057 */ private void apply_7(InternalRow i) { > /* 058 */ > /* 059 */ boolean isNull69 = MapObjects_loopIsNull1; > /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) > MapObjects_loopValue0.orig_bytes(); > /* 061 */ isNull69 = value69 == null; > /* 062 */ > /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty(); > /* 064 */ long value68 = isNull68 ? > /* 065 */ -1L : (Long) value69.get(); > /* 066 */ if (isNull68) { > /* 067 */ values1[17] = null; > /* 068 */ } else { > /* 069 */ values1[17] = value68; > /* 070 */ } > /* 071 */ boolean isNull71 = MapObjects_loopIsNull1; > /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) > MapObjects_loopValue0.resp_bytes(); > /* 073 */ isNull71 = value71 == null; > /* 074 */ > /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty(); > /* 076 */ long value70 = isNull70 ? > /* 077 */ -1L : (Long) value71.get(); > /* 078 */ if (isNull70) { > /* 079 */ values1[18] = null; > /* 080 */ } else { > /* 081 */ values1[18] = value70; > /* 082 */ } > /* 083
[jira] [Assigned] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used
[ https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18125: Assignee: Apache Spark > Spark generated code causes CompileException when groupByKey, reduceGroups > and map(_._2) are used > - > > Key: SPARK-18125 > URL: https://issues.apache.org/jira/browse/SPARK-18125 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 >Reporter: Ray Qiu >Assignee: Apache Spark >Priority: Critical > > Code logic looks like this: > {noformat} > .groupByKey > .reduceGroups > .map(_._2) > {noformat} > Works fine with 2.0.0. > 2.0.1 error Message: > {noformat} > Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 206, Column 123: Unknown variable or type "value4" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificMutableProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private java.lang.String errMsg; > /* 011 */ private java.lang.String errMsg1; > /* 012 */ private boolean MapObjects_loopIsNull1; > /* 013 */ private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0; > /* 014 */ private java.lang.String errMsg2; > /* 015 */ private Object[] values1; > /* 016 */ private boolean MapObjects_loopIsNull3; > /* 017 */ private java.lang.String MapObjects_loopValue2; > /* 018 */ private boolean isNull_0; > /* 019 */ private boolean value_0; > /* 020 */ private boolean isNull_1; > /* 021 */ private InternalRow value_1; > /* 022 */ > /* 023 */ private void apply_4(InternalRow i) { > /* 024 */ > /* 025 */ boolean isNull52 = MapObjects_loopIsNull1; > /* 026 */ final double value52 = isNull52 ? -1.0 : > MapObjects_loopValue0.ts(); > /* 027 */ if (isNull52) { > /* 028 */ values1[8] = null; > /* 029 */ } else { > /* 030 */ values1[8] = value52; > /* 031 */ } > /* 032 */ boolean isNull54 = MapObjects_loopIsNull1; > /* 033 */ final java.lang.String value54 = isNull54 ? null : > (java.lang.String) MapObjects_loopValue0.uid(); > /* 034 */ isNull54 = value54 == null; > /* 035 */ boolean isNull53 = isNull54; > /* 036 */ final UTF8String value53 = isNull53 ? null : > org.apache.spark.unsafe.types.UTF8String.fromString(value54); > /* 037 */ isNull53 = value53 == null; > /* 038 */ if (isNull53) { > /* 039 */ values1[9] = null; > /* 040 */ } else { > /* 041 */ values1[9] = value53; > /* 042 */ } > /* 043 */ boolean isNull56 = MapObjects_loopIsNull1; > /* 044 */ final java.lang.String value56 = isNull56 ? null : > (java.lang.String) MapObjects_loopValue0.src(); > /* 045 */ isNull56 = value56 == null; > /* 046 */ boolean isNull55 = isNull56; > /* 047 */ final UTF8String value55 = isNull55 ? null : > org.apache.spark.unsafe.types.UTF8String.fromString(value56); > /* 048 */ isNull55 = value55 == null; > /* 049 */ if (isNull55) { > /* 050 */ values1[10] = null; > /* 051 */ } else { > /* 052 */ values1[10] = value55; > /* 053 */ } > /* 054 */ } > /* 055 */ > /* 056 */ > /* 057 */ private void apply_7(InternalRow i) { > /* 058 */ > /* 059 */ boolean isNull69 = MapObjects_loopIsNull1; > /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) > MapObjects_loopValue0.orig_bytes(); > /* 061 */ isNull69 = value69 == null; > /* 062 */ > /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty(); > /* 064 */ long value68 = isNull68 ? > /* 065 */ -1L : (Long) value69.get(); > /* 066 */ if (isNull68) { > /* 067 */ values1[17] = null; > /* 068 */ } else { > /* 069 */ values1[17] = value68; > /* 070 */ } > /* 071 */ boolean isNull71 = MapObjects_loopIsNull1; > /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) > MapObjects_loopValue0.resp_bytes(); > /* 073 */ isNull71 = value71 == null; > /* 074 */ > /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty(); > /* 076 */ long value70 = isNull70 ? > /* 077 */ -1L : (Long) value71.get(); > /* 078 */ if (isNull70) { > /* 079 */ values1[18] = null; > /* 080 */ } else { > /* 081 */ values1[18] = value70; > /* 082 */ } > /* 083 */ boolean isNull74 = MapObjects_loopIsNull1; > /* 084 */
[jira] [Assigned] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used
[ https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18125: Assignee: (was: Apache Spark) > Spark generated code causes CompileException when groupByKey, reduceGroups > and map(_._2) are used > - > > Key: SPARK-18125 > URL: https://issues.apache.org/jira/browse/SPARK-18125 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.1 >Reporter: Ray Qiu >Priority: Critical > > Code logic looks like this: > {noformat} > .groupByKey > .reduceGroups > .map(_._2) > {noformat} > Works fine with 2.0.0. > 2.0.1 error Message: > {noformat} > Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 206, Column 123: Unknown variable or type "value4" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificMutableProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private java.lang.String errMsg; > /* 011 */ private java.lang.String errMsg1; > /* 012 */ private boolean MapObjects_loopIsNull1; > /* 013 */ private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0; > /* 014 */ private java.lang.String errMsg2; > /* 015 */ private Object[] values1; > /* 016 */ private boolean MapObjects_loopIsNull3; > /* 017 */ private java.lang.String MapObjects_loopValue2; > /* 018 */ private boolean isNull_0; > /* 019 */ private boolean value_0; > /* 020 */ private boolean isNull_1; > /* 021 */ private InternalRow value_1; > /* 022 */ > /* 023 */ private void apply_4(InternalRow i) { > /* 024 */ > /* 025 */ boolean isNull52 = MapObjects_loopIsNull1; > /* 026 */ final double value52 = isNull52 ? -1.0 : > MapObjects_loopValue0.ts(); > /* 027 */ if (isNull52) { > /* 028 */ values1[8] = null; > /* 029 */ } else { > /* 030 */ values1[8] = value52; > /* 031 */ } > /* 032 */ boolean isNull54 = MapObjects_loopIsNull1; > /* 033 */ final java.lang.String value54 = isNull54 ? null : > (java.lang.String) MapObjects_loopValue0.uid(); > /* 034 */ isNull54 = value54 == null; > /* 035 */ boolean isNull53 = isNull54; > /* 036 */ final UTF8String value53 = isNull53 ? null : > org.apache.spark.unsafe.types.UTF8String.fromString(value54); > /* 037 */ isNull53 = value53 == null; > /* 038 */ if (isNull53) { > /* 039 */ values1[9] = null; > /* 040 */ } else { > /* 041 */ values1[9] = value53; > /* 042 */ } > /* 043 */ boolean isNull56 = MapObjects_loopIsNull1; > /* 044 */ final java.lang.String value56 = isNull56 ? null : > (java.lang.String) MapObjects_loopValue0.src(); > /* 045 */ isNull56 = value56 == null; > /* 046 */ boolean isNull55 = isNull56; > /* 047 */ final UTF8String value55 = isNull55 ? null : > org.apache.spark.unsafe.types.UTF8String.fromString(value56); > /* 048 */ isNull55 = value55 == null; > /* 049 */ if (isNull55) { > /* 050 */ values1[10] = null; > /* 051 */ } else { > /* 052 */ values1[10] = value55; > /* 053 */ } > /* 054 */ } > /* 055 */ > /* 056 */ > /* 057 */ private void apply_7(InternalRow i) { > /* 058 */ > /* 059 */ boolean isNull69 = MapObjects_loopIsNull1; > /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) > MapObjects_loopValue0.orig_bytes(); > /* 061 */ isNull69 = value69 == null; > /* 062 */ > /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty(); > /* 064 */ long value68 = isNull68 ? > /* 065 */ -1L : (Long) value69.get(); > /* 066 */ if (isNull68) { > /* 067 */ values1[17] = null; > /* 068 */ } else { > /* 069 */ values1[17] = value68; > /* 070 */ } > /* 071 */ boolean isNull71 = MapObjects_loopIsNull1; > /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) > MapObjects_loopValue0.resp_bytes(); > /* 073 */ isNull71 = value71 == null; > /* 074 */ > /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty(); > /* 076 */ long value70 = isNull70 ? > /* 077 */ -1L : (Long) value71.get(); > /* 078 */ if (isNull70) { > /* 079 */ values1[18] = null; > /* 080 */ } else { > /* 081 */ values1[18] = value70; > /* 082 */ } > /* 083 */ boolean isNull74 = MapObjects_loopIsNull1; > /* 084 */ final scala.Option value74 =
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621541#comment-15621541 ] zhangxinyu commented on SPARK-17935: Thanks Cody Koeninger! Here is my opinion: * Kafka producer isn't idempotent. Yes, we only can make sure data is "at least once" in KafkaSink. Howerver, this problem shouldn't be solved here. * KafkaSinkRDD isn't necessary. Yes, it is indeed unnecessary, and let me modify the kafkaSink design doc. This suggestion is very useful! * CachedKafkaProducer is necessary, in case that users require to create two or more prodcuers with different producer paragrams. > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621603#comment-15621603 ] zhangxinyu commented on SPARK-17935: Could you please take a look at current kafka sink design? > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/31/16 7:59 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. Like *ForeachSink*, dataset will be transformed to RDD, so that data can be send to kafka cluster distributedly * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-18160: --- Description: {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} The above command can reproduce the error as following in a multiple node cluster. To be noticed, this issue only happens in multiple node cluster. As in the single node cluster, AM use the same local filesystem as the the driver. {noformat} 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.SparkContext.(SparkContext.scala:462) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) {noformat} was: {noformat} bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --files /usr/spark-client/conf/hive-site.xml examples/target/original-spark-examples_2.11.jar {noformat} The above command can reproduce the error as following in a multiple node cluster. To be noticed, this issue only happens in multiple node cluster. As in the single node cluster, AM use the same local filesystem as the the driver. {noformat} 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.SparkContext.(SparkContext.scala:462) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621678#comment-15621678 ] Sean Owen commented on SPARK-18112: --- Yes, this may be an instance where Hive 2 will have to shim to be compatible with Spark 1 vs 2 simultaneously. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: Spark Submit, SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
Hyukjin Kwon created SPARK-18179: Summary: Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function Key: SPARK-18179 URL: https://issues.apache.org/jira/browse/SPARK-18179 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Minor {code} scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', cast('1990-01-01' as timestamp))") {code} throws an exception as below: {code} java.util.NoSuchElementException: key not found: TimestampType at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159) at org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) {code} We should throw analysis exception with a better message when the types are unsupported rather than {{java.util.NoSuchElementException: key not found: TimestampType}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18178) Importing Pandas Tables with Missing Values
Kevin Mader created SPARK-18178: --- Summary: Importing Pandas Tables with Missing Values Key: SPARK-18178 URL: https://issues.apache.org/jira/browse/SPARK-18178 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Kevin Mader If you import a table with missing values (like below) and create a dataframe from it, everything works fine until the command is actually execute (.first(), or .toPandas(), etc). The problem came up with a much larger table with values that were not NAN, just empty. ``` import pandas as pd from io import StringIO test_df = pd.read_csv(StringIO(',Scan Options\n15,SAT2\n16,\n')) sqlContext.createDataFrame(test_df).registerTempTable('Test') o_qry = sqlContext.sql("SELECT * FROM Test LIMIT 1") o_qry.first() ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu edited comment on SPARK-17935 at 10/31/16 8:25 AM: -- h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *ForeachSink* with *KafkaForeachWriter* is created. * *KafkaForeachWriter* *KafkaForeachWriter* is like what I proposed before. The only difference is that KafkaProducer is new producer and created in *CachedKafkaProducer*. * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() was (Author: zhangxinyu): h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is created once in per jvm for the same producer paragrams. Data will be send to kafka cluster distributedly which is like *ForeachSink* * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafka-sink-10") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620885#comment-15620885 ] Harish edited comment on SPARK-16740 at 10/31/16 12:38 PM: --- Thank you. I downloaded the 2.0.2 snapshot with 2.7 Hadoop (i think its on 10/13). I can still reproduce this issue. If the "2.0.2-rc1" was updated after 10/13 then i will take the updates and try. Can you please help me to find the latest download path.? I am going to try 2.0.3 snap shot from below location -- any suggestions? http://people.apache.org/~pwendell/spark-nightly/spark-branch-2.0-bin/latest/spark-2.0.3-SNAPSHOT-bin-hadoop2.7.tgz was (Author: harishk15): Thank you. I downloaded the 2.0.2 snapshot with 2.7 Hadoop (i think its on 10/13). I can still reproduce this issue. If the "2.0.2-rc1" was updated after 10/13 then i will take the updates and try. Can you please help me to find the latest download path.? > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Sylvain Zimmer > Fix For: 2.0.1, 2.1.0 > > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at >
[jira] [Assigned] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
[ https://issues.apache.org/jira/browse/SPARK-18179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18179: Assignee: (was: Apache Spark) > Throws analysis exception with a proper message for unsupported argument > types in reflect/java_method function > -- > > Key: SPARK-18179 > URL: https://issues.apache.org/jira/browse/SPARK-18179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', > cast('1990-01-01' as timestamp))") > {code} > throws an exception as below: > {code} > java.util.NoSuchElementException: key not found: TimestampType > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > {code} > We should throw analysis exception with a better message when the types are > unsupported rather than {{java.util.NoSuchElementException: key not found: > TimestampType}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17695) Deserialization error when using DataFrameReader.json on JSON line that contains an empty JSON object
[ https://issues.apache.org/jira/browse/SPARK-17695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622415#comment-15622415 ] Miguel Cabrera commented on SPARK-17695: Hi, is there a way to prevent this? besides not using the {{json}} method? I currently mapping the underlying rdd and transforming into {{StringRDD}} with the already serialized json. I am using PySpark though and the default json serializer. > Deserialization error when using DataFrameReader.json on JSON line that > contains an empty JSON object > - > > Key: SPARK-17695 > URL: https://issues.apache.org/jira/browse/SPARK-17695 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Scala 2.11.7 >Reporter: Jonathan Simozar > > When using the {{DataFrameReader}} method {{json}} on the JSON > {noformat}{"field1":{},"field2":"a"}{noformat} > {{field1}} is removed at deserialization. > This can be reproduced in the example below. > {code:java}// create spark context > val sc: SparkContext = new SparkContext("local[*]", "My App") > // create spark session > val sparkSession: SparkSession = > SparkSession.builder().config(sc.getConf).getOrCreate() > // create rdd > val strings = sc.parallelize(Seq( > """{"field1":{},"field2":"a"}""" > )) > // create json DataSet[Row], convert back to RDD, and print lines to stdout > sparkSession.read.json(strings) > .toJSON.collect().foreach(println) > {code} > *stdout* > {noformat} > {"field2":"a"} > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
[ https://issues.apache.org/jira/browse/SPARK-18179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622367#comment-15622367 ] Apache Spark commented on SPARK-18179: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15694 > Throws analysis exception with a proper message for unsupported argument > types in reflect/java_method function > -- > > Key: SPARK-18179 > URL: https://issues.apache.org/jira/browse/SPARK-18179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Minor > > {code} > scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', > cast('1990-01-01' as timestamp))") > {code} > throws an exception as below: > {code} > java.util.NoSuchElementException: key not found: TimestampType > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > {code} > We should throw analysis exception with a better message when the types are > unsupported rather than {{java.util.NoSuchElementException: key not found: > TimestampType}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18180) pyspark.sql.Row does not serialize well to json
Miguel Cabrera created SPARK-18180: -- Summary: pyspark.sql.Row does not serialize well to json Key: SPARK-18180 URL: https://issues.apache.org/jira/browse/SPARK-18180 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.1 Environment: HDP 2.3.4, Spark 2.0.1, Reporter: Miguel Cabrera {{Row}} does not serialize well automatically. Although they are dict-like in Python, the json module does not see to be able to serialize it. {noformat} from pyspark.sql import Row import json r = Row(field1='hello', field2='world') json.dumps(r) {noformat} Results: {noformat} '["hello", "world"]' {noformat} Expected: {noformat} {'field1':'hellow', 'field2':'world'} {noformat} The work around is to call the {{asDict()}} method of Row. However, this makes custom serializing of nested objects really painful as the person has to be aware that is serializing a Row object. In particular with SPARK-17695, you cannot serialize DataFrames easily if you have some empty or null fields, so you have to customize the serialization process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
[ https://issues.apache.org/jira/browse/SPARK-18179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18179: Assignee: Apache Spark > Throws analysis exception with a proper message for unsupported argument > types in reflect/java_method function > -- > > Key: SPARK-18179 > URL: https://issues.apache.org/jira/browse/SPARK-18179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > {code} > scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', > cast('1990-01-01' as timestamp))") > {code} > throws an exception as below: > {code} > java.util.NoSuchElementException: key not found: TimestampType > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159) > at > org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > {code} > We should throw analysis exception with a better message when the types are > unsupported rather than {{java.util.NoSuchElementException: key not found: > TimestampType}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622740#comment-15622740 ] Apache Spark commented on SPARK-18143: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15695 > History Server is broken because of the refactoring work in Structured > Streaming > > > Key: SPARK-18143 > URL: https://issues.apache.org/jira/browse/SPARK-18143 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.2 >Reporter: Shixiong Zhu >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622645#comment-15622645 ] Lianhui Wang commented on SPARK-15616: -- For 2.0, I have created a new branch https://github.com/lianhuiwang/spark/tree/2.0_partition_broadcast. You can retry again. > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org