date:20161031

[jira] [Created] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations

2016-10-31 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18184:
--

 Summary: INSERT [INTO|OVERWRITE] TABLE ... PARTITION for 
Datasource tables cannot handle partitions with custom locations
 Key: SPARK-18184
 URL: https://issues.apache.org/jira/browse/SPARK-18184
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


As part of https://issues.apache.org/jira/browse/SPARK-17861 we should support 
this case as well now that Datasource table partitions can have custom 
locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Eric Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18183:
---
Component/s: SQL

> INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource 
> table instead of just the specified partition
> ---
>
> Key: SPARK-18183
> URL: https://issues.apache.org/jira/browse/SPARK-18183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> In Hive, this will only overwrite the specified partition. We should fix this 
> for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached

2016-10-31 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17972:
-
Target Version/s: 2.1.0  (was: 2.0.2, 2.1.0)

> Query planning slows down dramatically for large query plans even when 
> sub-trees are cached
> ---
>
> Key: SPARK-17972
> URL: https://issues.apache.org/jira/browse/SPARK-17972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following Spark shell snippet creates a series of query plans that grow 
> exponentially. The {{i}}-th plan is created using 4 *cached* copies of the 
> {{i - 1}}-th plan.
> {code}
> (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
>   val start = System.currentTimeMillis()
>   val result = plan.join(plan, "value").join(plan, "value").join(plan, 
> "value").join(plan, "value")
>   result.cache()
>   System.out.println(s"Iteration $iteration takes time 
> ${System.currentTimeMillis() - start} ms")
>   result.as[Int]
> }
> {code}
> We can see that although all plans are cached, the query planning time still 
> grows exponentially and quickly becomes unbearable.
> {noformat}
> Iteration 0 takes time 9 ms
> Iteration 1 takes time 19 ms
> Iteration 2 takes time 61 ms
> Iteration 3 takes time 219 ms
> Iteration 4 takes time 830 ms
> Iteration 5 takes time 4080 ms
> {noformat}
> Similar scenarios can be found in iterative ML code and significantly affects 
> usability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-10-31 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623661#comment-15623661
 ] 

Seth Hendrickson commented on SPARK-15784:
--

This seems like it fits the framework of a feature transformer. We could 
generate a real-valued feature column using PIC algorithm where the values are 
just the components of the pseudo-eigenvector. Alternatively we could pipeline 
a KMeans clustering on the end, but I think it makes more sense to let users do 
that themselves - but that's up for debate.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-31 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18167.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15701
[https://github.com/apache/spark/pull/15701]

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
> Fix For: 2.1.0
>
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
>

[jira] [Updated] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-31 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18167:
-
Assignee: Eric Liang

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
>

[jira] [Created] (SPARK-18185) Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-10-31 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18185:
--

 Summary: Should disallow INSERT OVERWRITE TABLE of Datasource 
tables with dynamic partitions
 Key: SPARK-18185
 URL: https://issues.apache.org/jira/browse/SPARK-18185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the updated 
partitions as in Hive.

This is non-trivial to fix in 2.1, so we should throw an exception here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623499#comment-15623499
 ] 

Apache Spark commented on SPARK-18167:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15701

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
>

[jira] [Created] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-10-31 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18186:
--

 Summary: Migrate HiveUDAFFunction to TypedImperativeAggregate for 
partial aggregation support
 Key: SPARK-18186
 URL: https://issues.apache.org/jira/browse/SPARK-18186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 1.6.2
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any query 
involving any Hive UDAFs has to fall back to {{SortAggregateExec}} without 
partial aggregation.

This issue can be fixed by migrating {{HiveUDAFFunction}} to 
{{TypedImperativeAggregate}}, which already provides partial aggregation 
support for aggregate functions that may use arbitrary Java objects as 
aggregation states.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623707#comment-15623707
 ] 

Apache Spark commented on SPARK-17732:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15704

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17964) Enable SparkR with Mesos client mode

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623427#comment-15623427
 ] 

Apache Spark commented on SPARK-17964:
--

User 'susanxhuynh' has created a pull request for this issue:
https://github.com/apache/spark/pull/15700

> Enable SparkR with Mesos client mode
> 
>
> Key: SPARK-17964
> URL: https://issues.apache.org/jira/browse/SPARK-17964
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17964) Enable SparkR with Mesos client mode

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17964:


Assignee: Apache Spark

> Enable SparkR with Mesos client mode
> 
>
> Key: SPARK-17964
> URL: https://issues.apache.org/jira/browse/SPARK-17964
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Sun Rui
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18186:


Assignee: Apache Spark  (was: Cheng Lian)

> Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation 
> support
> 
>
> Key: SPARK-18186
> URL: https://issues.apache.org/jira/browse/SPARK-18186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any 
> query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} 
> without partial aggregation.
> This issue can be fixed by migrating {{HiveUDAFFunction}} to 
> {{TypedImperativeAggregate}}, which already provides partial aggregation 
> support for aggregate functions that may use arbitrary Java objects as 
> aggregation states.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18186:


Assignee: Cheng Lian  (was: Apache Spark)

> Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation 
> support
> 
>
> Key: SPARK-18186
> URL: https://issues.apache.org/jira/browse/SPARK-18186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any 
> query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} 
> without partial aggregation.
> This issue can be fixed by migrating {{HiveUDAFFunction}} to 
> {{TypedImperativeAggregate}}, which already provides partial aggregation 
> support for aggregate functions that may use arbitrary Java objects as 
> aggregation states.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623688#comment-15623688
 ] 

Apache Spark commented on SPARK-18186:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15703

> Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation 
> support
> 
>
> Key: SPARK-18186
> URL: https://issues.apache.org/jira/browse/SPARK-18186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any 
> query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} 
> without partial aggregation.
> This issue can be fixed by migrating {{HiveUDAFFunction}} to 
> {{TypedImperativeAggregate}}, which already provides partial aggregation 
> support for aggregate functions that may use arbitrary Java objects as 
> aggregation states.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18030.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite 
> -
>
> Key: SPARK-18030
> URL: https://issues.apache.org/jira/browse/SPARK-18030
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18188) Add checksum for block in Spark

2016-10-31 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18188:
--

 Summary: Add checksum for block in Spark
 Key: SPARK-18188
 URL: https://issues.apache.org/jira/browse/SPARK-18188
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu


There is an understanding issue for a long time: 
https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the 
blocks, it's very hard for us to identify where is the bug came from.

We should have a way the check a block from remote node or disk that is correct 
or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-10-31 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-18187:


 Summary: CompactibleFileStreamLog should not rely on 
"compactInterval" to detect a compaction batch
 Key: SPARK-18187
 URL: https://issues.apache.org/jira/browse/SPARK-18187
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.1
Reporter: Shixiong Zhu


Right now CompactibleFileStreamLog uses compactInterval to check if a batch is 
a compaction batch. However, since this conf is controlled by the user, they 
may just change it, and CompactibleFileStreamLog will just silently return the 
wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18183:


Assignee: (was: Apache Spark)

> INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource 
> table instead of just the specified partition
> ---
>
> Key: SPARK-18183
> URL: https://issues.apache.org/jira/browse/SPARK-18183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> In Hive, this will only overwrite the specified partition. We should fix this 
> for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18183:


Assignee: Apache Spark

> INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource 
> table instead of just the specified partition
> ---
>
> Key: SPARK-18183
> URL: https://issues.apache.org/jira/browse/SPARK-18183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In Hive, this will only overwrite the specified partition. We should fix this 
> for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623811#comment-15623811
 ] 

Apache Spark commented on SPARK-18184:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15705

> INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot 
> handle partitions with custom locations
> 
>
> Key: SPARK-18184
> URL: https://issues.apache.org/jira/browse/SPARK-18184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As part of https://issues.apache.org/jira/browse/SPARK-17861 we should 
> support this case as well now that Datasource table partitions can have 
> custom locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18184:


Assignee: Apache Spark

> INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot 
> handle partitions with custom locations
> 
>
> Key: SPARK-18184
> URL: https://issues.apache.org/jira/browse/SPARK-18184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> As part of https://issues.apache.org/jira/browse/SPARK-17861 we should 
> support this case as well now that Datasource table partitions can have 
> custom locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18184:


Assignee: (was: Apache Spark)

> INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot 
> handle partitions with custom locations
> 
>
> Key: SPARK-18184
> URL: https://issues.apache.org/jira/browse/SPARK-18184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As part of https://issues.apache.org/jira/browse/SPARK-17861 we should 
> support this case as well now that Datasource table partitions can have 
> custom locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623809#comment-15623809
 ] 

Apache Spark commented on SPARK-18183:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15705

> INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource 
> table instead of just the specified partition
> ---
>
> Key: SPARK-18183
> URL: https://issues.apache.org/jira/browse/SPARK-18183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> In Hive, this will only overwrite the specified partition. We should fix this 
> for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17964) Enable SparkR with Mesos client mode

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17964:


Assignee: (was: Apache Spark)

> Enable SparkR with Mesos client mode
> 
>
> Key: SPARK-17964
> URL: https://issues.apache.org/jira/browse/SPARK-17964
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18143.
-
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.1.0
   2.0.3

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.0.3, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Eric Liang (JIRA)

Eric Liang created SPARK-18183:
--

 Summary: INSERT OVERWRITE TABLE ... PARTITION will overwrite the 
entire Datasource table instead of just the specified partition
 Key: SPARK-18183
 URL: https://issues.apache.org/jira/browse/SPARK-18183
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang


In Hive, this will only overwrite the specified partition. We should fix this 
for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached

2016-10-31 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623363#comment-15623363
 ] 

Yin Huai commented on SPARK-17972:
--

This issue has been resolved by https://github.com/apache/spark/pull/15651.

> Query planning slows down dramatically for large query plans even when 
> sub-trees are cached
> ---
>
> Key: SPARK-17972
> URL: https://issues.apache.org/jira/browse/SPARK-17972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> The following Spark shell snippet creates a series of query plans that grow 
> exponentially. The {{i}}-th plan is created using 4 *cached* copies of the 
> {{i - 1}}-th plan.
> {code}
> (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
>   val start = System.currentTimeMillis()
>   val result = plan.join(plan, "value").join(plan, "value").join(plan, 
> "value").join(plan, "value")
>   result.cache()
>   System.out.println(s"Iteration $iteration takes time 
> ${System.currentTimeMillis() - start} ms")
>   result.as[Int]
> }
> {code}
> We can see that although all plans are cached, the query planning time still 
> grows exponentially and quickly becomes unbearable.
> {noformat}
> Iteration 0 takes time 9 ms
> Iteration 1 takes time 19 ms
> Iteration 2 takes time 61 ms
> Iteration 3 takes time 219 ms
> Iteration 4 takes time 830 ms
> Iteration 5 takes time 4080 ms
> {noformat}
> Similar scenarios can be found in iterative ML code and significantly affects 
> usability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17972) Query planning slows down dramatically for large query plans even when sub-trees are cached

2016-10-31 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17972.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Query planning slows down dramatically for large query plans even when 
> sub-trees are cached
> ---
>
> Key: SPARK-17972
> URL: https://issues.apache.org/jira/browse/SPARK-17972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> The following Spark shell snippet creates a series of query plans that grow 
> exponentially. The {{i}}-th plan is created using 4 *cached* copies of the 
> {{i - 1}}-th plan.
> {code}
> (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
>   val start = System.currentTimeMillis()
>   val result = plan.join(plan, "value").join(plan, "value").join(plan, 
> "value").join(plan, "value")
>   result.cache()
>   System.out.println(s"Iteration $iteration takes time 
> ${System.currentTimeMillis() - start} ms")
>   result.as[Int]
> }
> {code}
> We can see that although all plans are cached, the query planning time still 
> grows exponentially and quickly becomes unbearable.
> {noformat}
> Iteration 0 takes time 9 ms
> Iteration 1 takes time 19 ms
> Iteration 2 takes time 61 ms
> Iteration 3 takes time 219 ms
> Iteration 4 takes time 830 ms
> Iteration 5 takes time 4080 ms
> {noformat}
> Similar scenarios can be found in iterative ML code and significantly affects 
> usability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18124) Implement watermarking for handling late data

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623603#comment-15623603
 ] 

Apache Spark commented on SPARK-18124:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/15702

> Implement watermarking for handling late data
> -
>
> Key: SPARK-18124
> URL: https://issues.apache.org/jira/browse/SPARK-18124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Michael Armbrust
>
> Whenever we aggregate data by event time, we want to consider data is late 
> and out-of-order in terms of its event time. Since we keep aggregate keyed by 
> the time as state, the state will grow unbounded if we keep around all old 
> aggregates in an attempt consider arbitrarily late data. Since the state is a 
> store in-memory, we have to prevent building up of this unbounded state. 
> Hence, we need a watermarking mechanism by which we will mark data that is 
> older beyond a threshold as “too late”, and stop updating the aggregates with 
> them. This would allow us to remove old aggregates that are never going to be 
> updated, thus bounding the size of the state.
> Here is the design doc - 
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18124) Implement watermarking for handling late data

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18124:


Assignee: Apache Spark  (was: Michael Armbrust)

> Implement watermarking for handling late data
> -
>
> Key: SPARK-18124
> URL: https://issues.apache.org/jira/browse/SPARK-18124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Whenever we aggregate data by event time, we want to consider data is late 
> and out-of-order in terms of its event time. Since we keep aggregate keyed by 
> the time as state, the state will grow unbounded if we keep around all old 
> aggregates in an attempt consider arbitrarily late data. Since the state is a 
> store in-memory, we have to prevent building up of this unbounded state. 
> Hence, we need a watermarking mechanism by which we will mark data that is 
> older beyond a threshold as “too late”, and stop updating the aggregates with 
> them. This would allow us to remove old aggregates that are never going to be 
> updated, thus bounding the size of the state.
> Here is the design doc - 
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18124) Implement watermarking for handling late data

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18124:


Assignee: Michael Armbrust  (was: Apache Spark)

> Implement watermarking for handling late data
> -
>
> Key: SPARK-18124
> URL: https://issues.apache.org/jira/browse/SPARK-18124
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Michael Armbrust
>
> Whenever we aggregate data by event time, we want to consider data is late 
> and out-of-order in terms of its event time. Since we keep aggregate keyed by 
> the time as state, the state will grow unbounded if we keep around all old 
> aggregates in an attempt consider arbitrarily late data. Since the state is a 
> store in-memory, we have to prevent building up of this unbounded state. 
> Hence, we need a watermarking mechanism by which we will mark data that is 
> older beyond a threshold as “too late”, and stop updating the aggregates with 
> them. This would allow us to remove old aggregates that are never going to be 
> updated, thus bounding the size of the state.
> Here is the design doc - 
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15472) Add support for writing partitioned `csv`, `json`, `text` formats in Structured Streaming

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624475#comment-15624475
 ] 

Apache Spark commented on SPARK-15472:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13575

> Add support for writing partitioned `csv`, `json`, `text` formats in 
> Structured Streaming
> -
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin(Inactive)
>Assignee: Reynold Xin
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17992) HiveClient.getPartitionsByFilter throws an exception for some unsupported filters when hive.metastore.try.direct.sql=false

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17992:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> HiveClient.getPartitionsByFilter throws an exception for some unsupported 
> filters when hive.metastore.try.direct.sql=false
> --
>
> Key: SPARK-17992
> URL: https://issues.apache.org/jira/browse/SPARK-17992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Allman
>
> We recently added (and enabled by default) table partition pruning for 
> partitioned Hive tables converted to using {{TableFileCatalog}}. When the 
> Hive configuration option {{hive.metastore.try.direct.sql}} is set to 
> {{false}}, Hive will throw an exception for unsupported filter expressions. 
> For example, attempting to filter on an integer partition column will throw a 
> {{org.apache.hadoop.hive.metastore.api.MetaException}}.
> I discovered this behavior because VideoAmp uses the CDH version of Hive with 
> a Postgresql metastore DB. In this configuration, CDH sets 
> {{hive.metastore.try.direct.sql}} to {{false}} by default, and queries that 
> filter on a non-string partition column will fail. That would be a rather 
> rude surprise for these Spark 2.1 users...
> I'm not sure exactly what behavior we should expect, but I suggest that 
> {{HiveClientImpl.getPartitionsByFilter}} catch this metastore exception and 
> return all partitions instead. This is what Spark does for Hive 0.12 users, 
> which does not support this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18184:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot 
> handle partitions with custom locations
> 
>
> Key: SPARK-18184
> URL: https://issues.apache.org/jira/browse/SPARK-18184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> As part of https://issues.apache.org/jira/browse/SPARK-17861 we should 
> support this case as well now that Datasource table partitions can have 
> custom locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18173) data source tables should support truncating partition

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18173:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-17861

> data source tables should support truncating partition
> --
>
> Key: SPARK-18173
> URL: https://issues.apache.org/jira/browse/SPARK-18173
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-10-31 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624114#comment-15624114
 ] 

Vincent commented on SPARK-17055:
-

[~srowen] May I ask the reason why we close this issue? It'd be helpful for us 
to understand current guideline if we are to implement more features in 
ML/MLLIB, thanks.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Yang Yang (JIRA)

Yang Yang created SPARK-18189:
-

 Summary: task not serializable with groupByKey() + mapGroups() + 
map
 Key: SPARK-18189
 URL: https://issues.apache.org/jira/browse/SPARK-18189
 Project: Spark
  Issue Type: Bug
Reporter: Yang Yang


just run the following code 

val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
val grouped = a.groupByKey({x:(Int,Int)=>x._1})
val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
val yyy = sc.broadcast(1)
val last = mappedGroups.rdd.map(xx=>{
  val simpley = yyy.value
  
  1
})




spark says Task not serializable







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18183:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource 
> table instead of just the specified partition
> ---
>
> Key: SPARK-18183
> URL: https://issues.apache.org/jira/browse/SPARK-18183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> In Hive, this will only overwrite the specified partition. We should fix this 
> for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12469) Data Property Accumulators for Spark (formerly Consistent Accumulators)

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12469:

Target Version/s:   (was: 2.1.0)

> Data Property Accumulators for Spark (formerly Consistent Accumulators)
> ---
>
> Key: SPARK-12469
> URL: https://issues.apache.org/jira/browse/SPARK-12469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>
> Tasks executed on Spark workers are unable to modify values from the driver, 
> and accumulators are the one exception for this. Accumulators in Spark are 
> implemented in such a way that when a stage is recomputed (say for cache 
> eviction) the accumulator will be updated a second time. This makes 
> accumulators inside of transformations more difficult to use for things like 
> counting invalid records (one of the primary potential use cases of 
> collecting side information during a transformation). However in some cases 
> this counting during re-evaluation is exactly the behaviour we want (say in 
> tracking total execution time for a particular function). Spark would benefit 
> from a version of accumulators which did not double count even if stages were 
> re-executed.
> Motivating example:
> {code}
> val parseTime = sc.accumulator(0L)
> val parseFailures = sc.accumulator(0L)
> val parsedData = sc.textFile(...).flatMap { line =>
>   val start = System.currentTimeMillis()
>   val parsed = Try(parse(line))
>   if (parsed.isFailure) parseFailures += 1
>   parseTime += System.currentTimeMillis() - start
>   parsed.toOption
> }
> parsedData.cache()
> val resultA = parsedData.map(...).filter(...).count()
> // some intervening code.  Almost anything could happen here -- some of 
> parsedData may
> // get kicked out of the cache, or an executor where data was cached might 
> get lost
> val resultB = parsedData.filter(...).map(...).flatMap(...).count()
> // now we look at the accumulators
> {code}
> Here we would want parseFailures to only have been added to once for every 
> line which failed to parse.  Unfortunately, the current Spark accumulator API 
> doesn’t support the current parseFailures use case since if some data had 
> been evicted its possible that it will be double counted.
> See the full design document at 
> https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18025) Port streaming to use the commit protocol API

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624420#comment-15624420
 ] 

Apache Spark commented on SPARK-18025:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15710

> Port streaming to use the commit protocol API
> -
>
> Key: SPARK-18025
> URL: https://issues.apache.org/jira/browse/SPARK-18025
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18025) Port streaming to use the commit protocol API

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18025:


Assignee: (was: Apache Spark)

> Port streaming to use the commit protocol API
> -
>
> Key: SPARK-18025
> URL: https://issues.apache.org/jira/browse/SPARK-18025
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18025) Port streaming to use the commit protocol API

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18025:


Assignee: Apache Spark

> Port streaming to use the commit protocol API
> -
>
> Key: SPARK-18025
> URL: https://issues.apache.org/jira/browse/SPARK-18025
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18190) Fix R version to not the latest in AppVeyor

2016-10-31 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-18190:


 Summary: Fix R version to not the latest in AppVeyor
 Key: SPARK-18190
 URL: https://issues.apache.org/jira/browse/SPARK-18190
 Project: Spark
  Issue Type: Improvement
  Components: Build, SparkR
Reporter: Hyukjin Kwon


Currently, Spark supports the test on Windows via AppVeyor but not it seems 
failing to download R 3.3.1 after R 3.3.2 is released.

It downloads given R version after checking if that is the latest or not via 
http://rversions.r-pkg.org/r-release because the URL.

For example, the latest one has the URL as below:

https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe

and the old one has the URL as below.

https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe

The problem is, it seems the versions of R on Windows are not always synced 
with the latest versions.

Please check https://cloud.r-project.org

So, currently, AppVeyor tries to find 
https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which is 
the URL for old versions) as 3.3.2 is released but does not exist because it 
seems R 3.3.2 for Windows is not there.

It seems safer to lower the version as SparkR supports 3.1+ if I remember 
correctly.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Ergin Seyfe (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ergin Seyfe updated SPARK-18189:

Component/s: SQL

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yang Yang
>
> just run the following code 
> {code}
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> {code}
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18190) Fix R version to not the latest in AppVeyor

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18190:


Assignee: (was: Apache Spark)

> Fix R version to not the latest in AppVeyor
> ---
>
> Key: SPARK-18190
> URL: https://issues.apache.org/jira/browse/SPARK-18190
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Reporter: Hyukjin Kwon
>
> Currently, Spark supports the test on Windows via AppVeyor but not it seems 
> failing to download R 3.3.1 after R 3.3.2 is released.
> It downloads given R version after checking if that is the latest or not via 
> http://rversions.r-pkg.org/r-release because the URL.
> For example, the latest one has the URL as below:
> https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe
> and the old one has the URL as below.
> https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe
> The problem is, it seems the versions of R on Windows are not always synced 
> with the latest versions.
> Please check https://cloud.r-project.org
> So, currently, AppVeyor tries to find 
> https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which 
> is the URL for old versions) as 3.3.2 is released but does not exist because 
> it seems R 3.3.2 for Windows is not there.
> It seems safer to lower the version as SparkR supports 3.1+ if I remember 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18190) Fix R version to not the latest in AppVeyor

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624236#comment-15624236
 ] 

Apache Spark commented on SPARK-18190:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15709

> Fix R version to not the latest in AppVeyor
> ---
>
> Key: SPARK-18190
> URL: https://issues.apache.org/jira/browse/SPARK-18190
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Reporter: Hyukjin Kwon
>
> Currently, Spark supports the test on Windows via AppVeyor but not it seems 
> failing to download R 3.3.1 after R 3.3.2 is released.
> It downloads given R version after checking if that is the latest or not via 
> http://rversions.r-pkg.org/r-release because the URL.
> For example, the latest one has the URL as below:
> https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe
> and the old one has the URL as below.
> https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe
> The problem is, it seems the versions of R on Windows are not always synced 
> with the latest versions.
> Please check https://cloud.r-project.org
> So, currently, AppVeyor tries to find 
> https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which 
> is the URL for old versions) as 3.3.2 is released but does not exist because 
> it seems R 3.3.2 for Windows is not there.
> It seems safer to lower the version as SparkR supports 3.1+ if I remember 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16827:

Target Version/s: 2.0.3, 2.1.0
   Fix Version/s: (was: 2.0.2)
  (was: 2.1.0)

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18087:

Target Version/s: 2.1.0

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17637:

Affects Version/s: (was: 2.1.0)
 Target Version/s: 2.1.0

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18024) Introduce a commit protocol API along with OutputCommitter implementation

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18024.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Introduce a commit protocol API along with OutputCommitter implementation
> -
>
> Key: SPARK-18024
> URL: https://issues.apache.org/jira/browse/SPARK-18024
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> This commit protocol API should wrap around Hadoop's output committer. Later 
> we can expand the API to cover streaming commits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623869#comment-15623869
 ] 

Apache Spark commented on SPARK-18189:
--

User 'seyfe' has created a pull request for this issue:
https://github.com/apache/spark/pull/15706

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>Reporter: Yang Yang
>
> just run the following code 
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18189:


Assignee: Apache Spark

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>Reporter: Yang Yang
>Assignee: Apache Spark
>
> just run the following code 
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18189:


Assignee: (was: Apache Spark)

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>Reporter: Yang Yang
>
> just run the following code 
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18189:

Target Version/s: 2.1.0

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>Reporter: Yang Yang
>
> just run the following code 
> {code}
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> {code}
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18189:

Target Version/s: 2.0.3, 2.1.0  (was: 2.1.0)

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>Reporter: Yang Yang
>
> just run the following code 
> {code}
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> {code}
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14393:

Target Version/s: 2.1.0

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.0.1
>Reporter: Jason Piper
>Assignee: Xiangrui Meng
>  Labels: correctness
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18087.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Optimize insert to not require REPAIR TABLE
> ---
>
> Key: SPARK-18087
> URL: https://issues.apache.org/jira/browse/SPARK-18087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18190) Fix R version to not the latest in AppVeyor

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18190:


Assignee: Apache Spark

> Fix R version to not the latest in AppVeyor
> ---
>
> Key: SPARK-18190
> URL: https://issues.apache.org/jira/browse/SPARK-18190
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently, Spark supports the test on Windows via AppVeyor but not it seems 
> failing to download R 3.3.1 after R 3.3.2 is released.
> It downloads given R version after checking if that is the latest or not via 
> http://rversions.r-pkg.org/r-release because the URL.
> For example, the latest one has the URL as below:
> https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe
> and the old one has the URL as below.
> https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe
> The problem is, it seems the versions of R on Windows are not always synced 
> with the latest versions.
> Please check https://cloud.r-project.org
> So, currently, AppVeyor tries to find 
> https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which 
> is the URL for old versions) as 3.3.2 is released but does not exist because 
> it seems R 3.3.2 for Windows is not there.
> It seems safer to lower the version as SparkR supports 3.1+ if I remember 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-10-31 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18189:

Description: 
just run the following code 

{code}
val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
val grouped = a.groupByKey({x:(Int,Int)=>x._1})
val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
val yyy = sc.broadcast(1)
val last = mappedGroups.rdd.map(xx=>{
  val simpley = yyy.value
  
  1
})

{code}



spark says Task not serializable





  was:
just run the following code 

val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
val grouped = a.groupByKey({x:(Int,Int)=>x._1})
val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
val yyy = sc.broadcast(1)
val last = mappedGroups.rdd.map(xx=>{
  val simpley = yyy.value
  
  1
})




spark says Task not serializable






> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>Reporter: Yang Yang
>
> just run the following code 
> {code}
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> {code}
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623959#comment-15623959
 ] 

Apache Spark commented on SPARK-18167:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15708

> Flaky test when hive partition pruning is enabled
> -
>
> Key: SPARK-18167
> URL: https://issues.apache.org/jira/browse/SPARK-18167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
> partition pruning is enabled.
> Based on the stack traces, it seems to be an old issue where Hive fails to 
> cast a numeric partition column ("Invalid character string format for type 
> DECIMAL"). There are two possibilities here: either we are somehow corrupting 
> the partition table to have non-decimal values in that column, or there is a 
> transient issue with Derby.
> {code}
> Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null 
>at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497) at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
> at 
> org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
> at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)  
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>  at scala.collection.immutable.List.foreach(List.scala:381)  at 
>

[jira] [Commented] (SPARK-18024) Introduce a commit protocol API along with OutputCommitter implementation

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623957#comment-15623957
 ] 

Apache Spark commented on SPARK-18024:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15707

> Introduce a commit protocol API along with OutputCommitter implementation
> -
>
> Key: SPARK-18024
> URL: https://issues.apache.org/jira/browse/SPARK-18024
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This commit protocol API should wrap around Hadoop's output committer. Later 
> we can expand the API to cover streaming commits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18190) Fix R version to not the latest in AppVeyor

2016-10-31 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624176#comment-15624176
 ] 

Shivaram Venkataraman commented on SPARK-18190:
---

Yeah using a fixed version sounds good to me.

> Fix R version to not the latest in AppVeyor
> ---
>
> Key: SPARK-18190
> URL: https://issues.apache.org/jira/browse/SPARK-18190
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Reporter: Hyukjin Kwon
>
> Currently, Spark supports the test on Windows via AppVeyor but not it seems 
> failing to download R 3.3.1 after R 3.3.2 is released.
> It downloads given R version after checking if that is the latest or not via 
> http://rversions.r-pkg.org/r-release because the URL.
> For example, the latest one has the URL as below:
> https://cran.r-project.org/bin/windows/base/R-3.3.1-win.exe
> and the old one has the URL as below.
> https://cran.r-project.org/bin/windows/base/old/3.3.0/R-3.3.0-win.exe
> The problem is, it seems the versions of R on Windows are not always synced 
> with the latest versions.
> Please check https://cloud.r-project.org
> So, currently, AppVeyor tries to find 
> https://cran.r-project.org/bin/windows/base/old/3.3.1/R-3.3.1-win.exe (which 
> is the URL for old versions) as 3.3.2 is released but does not exist because 
> it seems R 3.3.2 for Windows is not there.
> It seems safer to lower the version as SparkR supports 3.1+ if I remember 
> correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18160) spark.files should not passed to driver in yarn-cluster mode

2016-10-31 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-18160:
---
Summary: spark.files should not passed to driver in yarn-cluster mode  
(was: SparkContext.addFile doesn't work in yarn-cluster mode)

> spark.files should not passed to driver in yarn-cluster mode
> 
>
> Key: SPARK-18160
> URL: https://issues.apache.org/jira/browse/SPARK-18160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Jeff Zhang
>Priority: Critical
>
> The following command will fails for spark 2.0
> {noformat}
> bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
> examples/target/original-spark-examples_2.11.jar
> {noformat}
> The above command can reproduce the error as following in a multiple node 
> cluster. To be noticed, this issue only happens in multiple node cluster. As 
> in the single node cluster, AM use the same local filesystem as the the 
> driver.
> {noformat}
> 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
> java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.apache.spark.SparkContext.(SparkContext.scala:462)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18160) spark.files should not be passed to driver in yarn-cluster mode

2016-10-31 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-18160:
---
Summary: spark.files should not be passed to driver in yarn-cluster mode  
(was: spark.files should not passed to driver in yarn-cluster mode)

> spark.files should not be passed to driver in yarn-cluster mode
> ---
>
> Key: SPARK-18160
> URL: https://issues.apache.org/jira/browse/SPARK-18160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Jeff Zhang
>Priority: Critical
>
> The following command will fails for spark 2.0
> {noformat}
> bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
> examples/target/original-spark-examples_2.11.jar
> {noformat}
> The above command can reproduce the error as following in a multiple node 
> cluster. To be noticed, this issue only happens in multiple node cluster. As 
> in the single node cluster, AM use the same local filesystem as the the 
> driver.
> {noformat}
> 16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
> java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
>   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at 
> org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.apache.spark.SparkContext.(SparkContext.scala:462)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18187) CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch

2016-10-31 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624263#comment-15624263
 ] 

Liwei Lin commented on SPARK-18187:
---

hi [~zsxwing] how are we planning to fix this? thanks

> CompactibleFileStreamLog should not rely on "compactInterval" to detect a 
> compaction batch
> --
>
> Key: SPARK-18187
> URL: https://issues.apache.org/jira/browse/SPARK-18187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>
> Right now CompactibleFileStreamLog uses compactInterval to check if a batch 
> is a compaction batch. However, since this conf is controlled by the user, 
> they may just change it, and CompactibleFileStreamLog will just silently 
> return the wrong answer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-31 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-18030:


Assignee: Shixiong Zhu  (was: Tathagata Das)

> Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite 
> -
>
> Key: SPARK-18030
> URL: https://issues.apache.org/jira/browse/SPARK-18030
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623225#comment-15623225
 ] 

Apache Spark commented on SPARK-18030:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15699

> Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite 
> -
>
> Key: SPARK-18030
> URL: https://issues.apache.org/jira/browse/SPARK-18030
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-31 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-18160:
---
Description: 
The following command will fails for spark 2.0
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}
and this command fails for spark 1.6
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}

The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
{noformat}

  was:
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}
The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at

[jira] [Updated] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-31 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-18160:
---
Description: 
The following command will fails for spark 2.0
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}


The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
{noformat}

  was:
The following command will fails for spark 2.0
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}
and this command fails for spark 1.6
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}

The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at

[jira] [Commented] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621651#comment-15621651
 ] 

Apache Spark commented on SPARK-18177:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15692

> Add missing 'subsamplingRate' of pyspark GBTClassifier
> --
>
> Key: SPARK-18177
> URL: https://issues.apache.org/jira/browse/SPARK-18177
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Add missing 'subsamplingRate' of pyspark GBTClassifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18177:


Assignee: (was: Apache Spark)

> Add missing 'subsamplingRate' of pyspark GBTClassifier
> --
>
> Key: SPARK-18177
> URL: https://issues.apache.org/jira/browse/SPARK-18177
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Add missing 'subsamplingRate' of pyspark GBTClassifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18177:


Assignee: Apache Spark

> Add missing 'subsamplingRate' of pyspark GBTClassifier
> --
>
> Key: SPARK-18177
> URL: https://issues.apache.org/jira/browse/SPARK-18177
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> Add missing 'subsamplingRate' of pyspark GBTClassifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18176) Kafka010 .createRDD() scala API should expect scala Map

2016-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18176:
--
Priority: Minor  (was: Major)

> Kafka010 .createRDD() scala API should expect scala Map
> ---
>
> Key: SPARK-18176
> URL: https://issues.apache.org/jira/browse/SPARK-18176
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Liwei Lin
>Priority: Minor
>
> Thoughout {{external/kafka-010}}, Java APIs are expecting {{java.lang.Maps}} 
> and Scala APIs are expecting {{scala.collection.Maps}}, with the exception of 
> {{KafkaUtils.createRDD()}} Scala API expecting a {{java.lang.Map}}.
> But please note, this is a public API change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17055) add labelKFold to CrossValidator

2016-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17055.
---
Resolution: Won't Fix

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-882.
-
Resolution: Not A Problem

> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>Priority: Minor
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/31/16 7:54 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. 
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will 
be created in function *addBatch*.
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/31/16 7:53 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will 
be created in function *addBatch*.
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will 
be created in function *addBatch*.
* *KafkaSinkRDD*
*KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It 
extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to 
get or create producer to send data
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18177) Add missing 'subsamplingRate' of pyspark GBTClassifier

2016-10-31 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-18177:


 Summary: Add missing 'subsamplingRate' of pyspark GBTClassifier
 Key: SPARK-18177
 URL: https://issues.apache.org/jira/browse/SPARK-18177
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: zhengruifeng


Add missing 'subsamplingRate' of pyspark GBTClassifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/31/16 8:04 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is 
created once in per jvm for the same producer paragrams. Data will be send to 
kafka cluster distributedly which is like *ForeachSink*
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is 
created once in a jvm for the same producer paragrams. Data will be send to 
kafka cluster distributedly which is like *ForeachSink*
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/31/16 8:03 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is 
created once in a jvm for the same producer paragrams. Data will be send to 
kafka cluster distributedly which is like *ForeachSink*
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. Like *ForeachSink*, 
dataset will be transformed to RDD, so that data can be send to kafka cluster 
distributedly
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621681#comment-15621681
 ] 

Apache Spark commented on SPARK-18125:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15693

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083

[jira] [Assigned] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18125:


Assignee: Apache Spark

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Assignee: Apache Spark
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 = MapObjects_loopIsNull1;
> /* 084 */

[jira] [Assigned] (SPARK-18125) Spark generated code causes CompileException when groupByKey, reduceGroups and map(_._2) are used

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18125:


Assignee: (was: Apache Spark)

> Spark generated code causes CompileException when groupByKey, reduceGroups 
> and map(_._2) are used
> -
>
> Key: SPARK-18125
> URL: https://issues.apache.org/jira/browse/SPARK-18125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Ray Qiu
>Priority: Critical
>
> Code logic looks like this:
> {noformat}
> .groupByKey
> .reduceGroups
> .map(_._2)
> {noformat}
> Works fine with 2.0.0.
> 2.0.1 error Message: 
> {noformat}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 206, Column 123: Unknown variable or type "value4"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificMutableProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private java.lang.String errMsg;
> /* 011 */   private java.lang.String errMsg1;
> /* 012 */   private boolean MapObjects_loopIsNull1;
> /* 013 */   private io.mistnet.analytics.lib.ConnLog MapObjects_loopValue0;
> /* 014 */   private java.lang.String errMsg2;
> /* 015 */   private Object[] values1;
> /* 016 */   private boolean MapObjects_loopIsNull3;
> /* 017 */   private java.lang.String MapObjects_loopValue2;
> /* 018 */   private boolean isNull_0;
> /* 019 */   private boolean value_0;
> /* 020 */   private boolean isNull_1;
> /* 021 */   private InternalRow value_1;
> /* 022 */
> /* 023 */   private void apply_4(InternalRow i) {
> /* 024 */
> /* 025 */ boolean isNull52 = MapObjects_loopIsNull1;
> /* 026 */ final double value52 = isNull52 ? -1.0 : 
> MapObjects_loopValue0.ts();
> /* 027 */ if (isNull52) {
> /* 028 */   values1[8] = null;
> /* 029 */ } else {
> /* 030 */   values1[8] = value52;
> /* 031 */ }
> /* 032 */ boolean isNull54 = MapObjects_loopIsNull1;
> /* 033 */ final java.lang.String value54 = isNull54 ? null : 
> (java.lang.String) MapObjects_loopValue0.uid();
> /* 034 */ isNull54 = value54 == null;
> /* 035 */ boolean isNull53 = isNull54;
> /* 036 */ final UTF8String value53 = isNull53 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value54);
> /* 037 */ isNull53 = value53 == null;
> /* 038 */ if (isNull53) {
> /* 039 */   values1[9] = null;
> /* 040 */ } else {
> /* 041 */   values1[9] = value53;
> /* 042 */ }
> /* 043 */ boolean isNull56 = MapObjects_loopIsNull1;
> /* 044 */ final java.lang.String value56 = isNull56 ? null : 
> (java.lang.String) MapObjects_loopValue0.src();
> /* 045 */ isNull56 = value56 == null;
> /* 046 */ boolean isNull55 = isNull56;
> /* 047 */ final UTF8String value55 = isNull55 ? null : 
> org.apache.spark.unsafe.types.UTF8String.fromString(value56);
> /* 048 */ isNull55 = value55 == null;
> /* 049 */ if (isNull55) {
> /* 050 */   values1[10] = null;
> /* 051 */ } else {
> /* 052 */   values1[10] = value55;
> /* 053 */ }
> /* 054 */   }
> /* 055 */
> /* 056 */
> /* 057 */   private void apply_7(InternalRow i) {
> /* 058 */
> /* 059 */ boolean isNull69 = MapObjects_loopIsNull1;
> /* 060 */ final scala.Option value69 = isNull69 ? null : (scala.Option) 
> MapObjects_loopValue0.orig_bytes();
> /* 061 */ isNull69 = value69 == null;
> /* 062 */
> /* 063 */ final boolean isNull68 = isNull69 || value69.isEmpty();
> /* 064 */ long value68 = isNull68 ?
> /* 065 */ -1L : (Long) value69.get();
> /* 066 */ if (isNull68) {
> /* 067 */   values1[17] = null;
> /* 068 */ } else {
> /* 069 */   values1[17] = value68;
> /* 070 */ }
> /* 071 */ boolean isNull71 = MapObjects_loopIsNull1;
> /* 072 */ final scala.Option value71 = isNull71 ? null : (scala.Option) 
> MapObjects_loopValue0.resp_bytes();
> /* 073 */ isNull71 = value71 == null;
> /* 074 */
> /* 075 */ final boolean isNull70 = isNull71 || value71.isEmpty();
> /* 076 */ long value70 = isNull70 ?
> /* 077 */ -1L : (Long) value71.get();
> /* 078 */ if (isNull70) {
> /* 079 */   values1[18] = null;
> /* 080 */ } else {
> /* 081 */   values1[18] = value70;
> /* 082 */ }
> /* 083 */ boolean isNull74 = MapObjects_loopIsNull1;
> /* 084 */ final scala.Option value74 =

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621541#comment-15621541
 ] 

zhangxinyu commented on SPARK-17935:


Thanks Cody Koeninger!

Here is my opinion:
* Kafka producer isn't idempotent. Yes, we only can make sure data is "at least 
once" in KafkaSink. Howerver, this problem shouldn't be solved here. 

* KafkaSinkRDD isn't necessary. Yes, it is indeed unnecessary, and let me 
modify the kafkaSink design doc. This suggestion is very useful!

* CachedKafkaProducer is necessary, in case that users require to create two or 
more prodcuers with different producer paragrams. 

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621603#comment-15621603
 ] 

zhangxinyu commented on SPARK-17935:


Could you please take a look at current kafka sink design?

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/31/16 7:59 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. Like *ForeachSink*, 
dataset will be transformed to RDD, so that data can be send to kafka cluster 
distributedly
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. 
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18160) SparkContext.addFile doesn't work in yarn-cluster mode

2016-10-31 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-18160:
---
Description: 
{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --conf spark.files=/usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}
The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
{noformat}

  was:

{noformat}
bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --files /usr/spark-client/conf/hive-site.xml 
examples/target/original-spark-examples_2.11.jar
{noformat}
The above command can reproduce the error as following in a multiple node 
cluster. To be noticed, this issue only happens in multiple node cluster. As in 
the single node cluster, AM use the same local filesystem as the the driver.
{noformat}
16/10/28 07:21:42 ERROR SparkContext: Error initializing SparkContext.
java.io.FileNotFoundException: File file:/usr/spark-client/conf/hive-site.xml 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:537)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:750)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:527)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1443)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1415)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at 
org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:462)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:462)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2296)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:843)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:835)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:835)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at

[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2016-10-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15621678#comment-15621678
 ] 

Sean Owen commented on SPARK-18112:
---

Yes, this may be an instance where Hive 2 will have to shim to be compatible 
with Spark 1 vs 2 simultaneously.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function

2016-10-31 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-18179:


 Summary: Throws analysis exception with a proper message for 
unsupported argument types in reflect/java_method function
 Key: SPARK-18179
 URL: https://issues.apache.org/jira/browse/SPARK-18179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Minor


{code}
scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', 
cast('1990-01-01' as timestamp))")
{code}

throws an exception as below:

{code}
java.util.NoSuchElementException: key not found: TimestampType
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at 
org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
  at 
org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158)
  at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
{code}

We should throw analysis exception with a better message when the types are 
unsupported rather than {{java.util.NoSuchElementException: key not found: 
TimestampType}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18178) Importing Pandas Tables with Missing Values

2016-10-31 Thread Kevin Mader (JIRA)

Kevin Mader created SPARK-18178:
---

 Summary: Importing Pandas Tables with Missing Values
 Key: SPARK-18178
 URL: https://issues.apache.org/jira/browse/SPARK-18178
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Kevin Mader


If you import a table with missing values (like below) and create a dataframe 
from it, everything works fine until the command is actually execute (.first(), 
or .toPandas(), etc). The problem came up with a much larger table with values 
that were not NAN, just empty.

```
import pandas as pd
from io import StringIO
test_df = pd.read_csv(StringIO(',Scan Options\n15,SAT2\n16,\n'))
sqlContext.createDataFrame(test_df).registerTempTable('Test')
o_qry = sqlContext.sql("SELECT * FROM Test LIMIT 1")
o_qry.first()
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-31 Thread zhangxinyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu edited comment on SPARK-17935 at 10/31/16 8:25 AM:
--

h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*ForeachSink* with *KafkaForeachWriter* is created.
* *KafkaForeachWriter*
*KafkaForeachWriter* is like what I proposed before. The only difference is 
that KafkaProducer is new producer and created in *CachedKafkaProducer*.
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()




was (Author: zhangxinyu):
h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. KafkaProducer is 
created once in per jvm for the same producer paragrams. Data will be send to 
kafka cluster distributedly which is like *ForeachSink*
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafka-sink-10")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException

2016-10-31 Thread Harish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620885#comment-15620885
 ] 

Harish edited comment on SPARK-16740 at 10/31/16 12:38 PM:
---

Thank you. I downloaded the 2.0.2 snapshot with 2.7 Hadoop (i think its on 
10/13). I can still reproduce this issue. If the "2.0.2-rc1" was updated after 
10/13 then i will take the updates and try. Can you please help me to find the 
latest download path.?
I am going to try 2.0.3 snap shot from below location -- any suggestions?
http://people.apache.org/~pwendell/spark-nightly/spark-branch-2.0-bin/latest/spark-2.0.3-SNAPSHOT-bin-hadoop2.7.tgz


was (Author: harishk15):
Thank you. I downloaded the 2.0.2 snapshot with 2.7 Hadoop (i think its on 
10/13). I can still reproduce this issue. If the "2.0.2-rc1" was updated after 
10/13 then i will take the updates and try. Can you please help me to find the 
latest download path.?

> joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
> 
>
> Key: SPARK-16740
> URL: https://issues.apache.org/jira/browse/SPARK-16740
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Sylvain Zimmer
> Fix For: 2.0.1, 2.1.0
>
>
> Hello,
> Here is a crash in Spark SQL joins, with a minimal reproducible test case. 
> Interestingly, it only seems to happen when reading Parquet data (I added a 
> {{crash = True}} variable to show it)
> This is an {{left_outer}} example, but it also crashes with a regular 
> {{inner}} join.
> {code}
> import os
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> schema1 = SparkTypes.StructType([
> SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True)
> ])
> schema2 = SparkTypes.StructType([
> SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True)
> ])
> # Valid Long values (-9223372036854775808 < -5543241376386463808 , 
> 4661454128115150227 < 9223372036854775807)
> data1 = [(4661454128115150227,), (-5543241376386463808,)]
> data2 = [(650460285, )]
> df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1)
> df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2)
> crash = True
> if crash:
> os.system("rm -rf /tmp/sparkbug")
> df1.write.parquet("/tmp/sparkbug/vertex")
> df2.write.parquet("/tmp/sparkbug/edge")
> df1 = sqlc.read.load("/tmp/sparkbug/vertex")
> df2 = sqlc.read.load("/tmp/sparkbug/edge")
> result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer")
> # Should print [Row(id2=650460285, id1=None)]
> print result_df.collect()
> {code}
> When ran with {{spark-submit}}, the final {{collect()}} call crashes with 
> this:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o61.collectToPython.
> : org.apache.spark.SparkException: Exception thrown in awaitResult:
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225)
>   at 
> org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
>

[jira] [Assigned] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18179:


Assignee: (was: Apache Spark)

> Throws analysis exception with a proper message for unsupported argument 
> types in reflect/java_method function
> --
>
> Key: SPARK-18179
> URL: https://issues.apache.org/jira/browse/SPARK-18179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', 
> cast('1990-01-01' as timestamp))")
> {code}
> throws an exception as below:
> {code}
> java.util.NoSuchElementException: key not found: TimestampType
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> {code}
> We should throw analysis exception with a better message when the types are 
> unsupported rather than {{java.util.NoSuchElementException: key not found: 
> TimestampType}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17695) Deserialization error when using DataFrameReader.json on JSON line that contains an empty JSON object

2016-10-31 Thread Miguel Cabrera (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622415#comment-15622415
 ] 

Miguel Cabrera commented on SPARK-17695:


Hi, is there a way to prevent this?  besides not using the {{json}} method?  I 
currently mapping the underlying rdd and transforming into {{StringRDD}} with 
the already serialized json. I am using PySpark though and the default json 
serializer. 

> Deserialization error when using DataFrameReader.json on JSON line that 
> contains an empty JSON object
> -
>
> Key: SPARK-17695
> URL: https://issues.apache.org/jira/browse/SPARK-17695
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Scala 2.11.7
>Reporter: Jonathan Simozar
>
> When using the {{DataFrameReader}} method {{json}} on the JSON
> {noformat}{"field1":{},"field2":"a"}{noformat}
> {{field1}} is removed at deserialization.
> This can be reproduced in the example below.
> {code:java}// create spark context
> val sc: SparkContext = new SparkContext("local[*]", "My App")
> // create spark session
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sc.getConf).getOrCreate()
> // create rdd
> val strings = sc.parallelize(Seq(
>   """{"field1":{},"field2":"a"}"""
> ))
> // create json DataSet[Row], convert back to RDD, and print lines to stdout
> sparkSession.read.json(strings)
>   .toJSON.collect().foreach(println)
> {code}
> *stdout*
> {noformat}
> {"field2":"a"}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622367#comment-15622367
 ] 

Apache Spark commented on SPARK-18179:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15694

> Throws analysis exception with a proper message for unsupported argument 
> types in reflect/java_method function
> --
>
> Key: SPARK-18179
> URL: https://issues.apache.org/jira/browse/SPARK-18179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> {code}
> scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', 
> cast('1990-01-01' as timestamp))")
> {code}
> throws an exception as below:
> {code}
> java.util.NoSuchElementException: key not found: TimestampType
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> {code}
> We should throw analysis exception with a better message when the types are 
> unsupported rather than {{java.util.NoSuchElementException: key not found: 
> TimestampType}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18180) pyspark.sql.Row does not serialize well to json

2016-10-31 Thread Miguel Cabrera (JIRA)

Miguel Cabrera created SPARK-18180:
--

 Summary: pyspark.sql.Row does not serialize well to json
 Key: SPARK-18180
 URL: https://issues.apache.org/jira/browse/SPARK-18180
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
 Environment: HDP 2.3.4, Spark 2.0.1, 
Reporter: Miguel Cabrera


{{Row}} does not serialize well automatically. Although they are dict-like in 
Python, the json module does not see to be able to serialize it.

{noformat}
from  pyspark.sql import Row
import json

r = Row(field1='hello', field2='world')
json.dumps(r)
{noformat}

Results:
{noformat}
'["hello", "world"]'
{noformat}

Expected:

{noformat}
{'field1':'hellow', 'field2':'world'}
{noformat}

The work around is to call the {{asDict()}} method of Row. However, this makes 
custom serializing of nested objects really painful as the person has to be 
aware that is serializing a Row object. In particular with SPARK-17695,   you 
cannot serialize DataFrames easily if you have some empty or null fields,  so 
you have to customize the serialization process. 






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18179) Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function

2016-10-31 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18179:


Assignee: Apache Spark

> Throws analysis exception with a proper message for unsupported argument 
> types in reflect/java_method function
> --
>
> Key: SPARK-18179
> URL: https://issues.apache.org/jira/browse/SPARK-18179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> scala> spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', 
> cast('1990-01-01' as timestamp))")
> {code}
> throws an exception as below:
> {code}
> java.util.NoSuchElementException: key not found: TimestampType
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
>   at 
> org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:158)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> {code}
> We should throw analysis exception with a better message when the types are 
> unsupported rather than {{java.util.NoSuchElementException: key not found: 
> TimestampType}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18143) History Server is broken because of the refactoring work in Structured Streaming

2016-10-31 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622740#comment-15622740
 ] 

Apache Spark commented on SPARK-18143:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15695

> History Server is broken because of the refactoring work in Structured 
> Streaming
> 
>
> Key: SPARK-18143
> URL: https://issues.apache.org/jira/browse/SPARK-18143
> Project: Spark
>  Issue Type: Sub-task
>Affects Versions: 2.0.2
>Reporter: Shixiong Zhu
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2016-10-31 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622645#comment-15622645
 ] 

Lianhui Wang commented on SPARK-15616:
--

For 2.0, I have created a new branch 
https://github.com/lianhuiwang/spark/tree/2.0_partition_broadcast. You can 
retry again.

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 113 matches

Mail list logo