[jira] [Assigned] (SPARK-18522) Create explicit contract for column stats serialization

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18522:


Assignee: Apache Spark  (was: Reynold Xin)

> Create explicit contract for column stats serialization
> ---
>
> Key: SPARK-18522
> URL: https://issues.apache.org/jira/browse/SPARK-18522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> The current implementation of column stats uses the base64 encoding of the 
> internal UnsafeRow format to persist statistics (in table properties in Hive 
> metastore). This is an internal format that is not stable across different 
> versions of Spark and should NOT be used for persistence.
> In addition, it would be better if statistics stored in the catalog is human 
> readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18522) Create explicit contract for column stats serialization

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18522:


Assignee: Reynold Xin  (was: Apache Spark)

> Create explicit contract for column stats serialization
> ---
>
> Key: SPARK-18522
> URL: https://issues.apache.org/jira/browse/SPARK-18522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The current implementation of column stats uses the base64 encoding of the 
> internal UnsafeRow format to persist statistics (in table properties in Hive 
> metastore). This is an internal format that is not stable across different 
> versions of Spark and should NOT be used for persistence.
> In addition, it would be better if statistics stored in the catalog is human 
> readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18522) Create explicit contract for column stats serialization

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682815#comment-15682815
 ] 

Apache Spark commented on SPARK-18522:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15959

> Create explicit contract for column stats serialization
> ---
>
> Key: SPARK-18522
> URL: https://issues.apache.org/jira/browse/SPARK-18522
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The current implementation of column stats uses the base64 encoding of the 
> internal UnsafeRow format to persist statistics (in table properties in Hive 
> metastore). This is an internal format that is not stable across different 
> versions of Spark and should NOT be used for persistence.
> In addition, it would be better if statistics stored in the catalog is human 
> readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17932:


Assignee: Apache Spark

> Failed to run SQL "show table extended  like table_name"  in Spark2.0.0
> ---
>
> Key: SPARK-17932
> URL: https://issues.apache.org/jira/browse/SPARK-17932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Assignee: Apache Spark
>
> SQL "show table extended  like table_name " doesn't work in spark 2.0.0
> that works in spark1.5.2
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> missing 'FUNCTIONS' at 'extended'(line 1, pos 11)
> == SQL ==
> show table extended  like test
> ---^^^ (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17932:


Assignee: (was: Apache Spark)

> Failed to run SQL "show table extended  like table_name"  in Spark2.0.0
> ---
>
> Key: SPARK-17932
> URL: https://issues.apache.org/jira/browse/SPARK-17932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> SQL "show table extended  like table_name " doesn't work in spark 2.0.0
> that works in spark1.5.2
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> missing 'FUNCTIONS' at 'extended'(line 1, pos 11)
> == SQL ==
> show table extended  like test
> ---^^^ (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682810#comment-15682810
 ] 

Apache Spark commented on SPARK-17932:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15958

> Failed to run SQL "show table extended  like table_name"  in Spark2.0.0
> ---
>
> Key: SPARK-17932
> URL: https://issues.apache.org/jira/browse/SPARK-17932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> SQL "show table extended  like table_name " doesn't work in spark 2.0.0
> that works in spark1.5.2
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> missing 'FUNCTIONS' at 'extended'(line 1, pos 11)
> == SQL ==
> show table extended  like test
> ---^^^ (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18522) Create explicit contract for column stats serialization

2016-11-20 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18522:
---

 Summary: Create explicit contract for column stats serialization
 Key: SPARK-18522
 URL: https://issues.apache.org/jira/browse/SPARK-18522
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


The current implementation of column stats uses the base64 encoding of the 
internal UnsafeRow format to persist statistics (in table properties in Hive 
metastore). This is an internal format that is not stable across different 
versions of Spark and should NOT be used for persistence.

In addition, it would be better if statistics stored in the catalog is human 
readable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before the worker actually got the response of registration

2016-11-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682778#comment-15682778
 ] 

Shixiong Zhu commented on SPARK-17755:
--

It's not easy to fix. The root cause is messages are sent via two different 
channels and their order is not guaranteed.

> Master may ask a worker to launch an executor before the worker actually got 
> the response of registration
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18521) Add `NoRedundantStringInterpolator` Scala rule

2016-11-20 Thread Weiqing Yang (JIRA)
Weiqing Yang created SPARK-18521:


 Summary: Add `NoRedundantStringInterpolator` Scala rule
 Key: SPARK-18521
 URL: https://issues.apache.org/jira/browse/SPARK-18521
 Project: Spark
  Issue Type: Improvement
Reporter: Weiqing Yang


Currently the s string interpolator is used in many cases in which there is no 
embed variable reference in the processed string literals. 
For example:
core/src/main/scala/org/apache/spark/deploy/Client.scala
{code}
logError(s"Error processing messages, exiting.")
{code}

examples/src/main/scala/org/apache/spark/examples/graphx/SynthBenchmark.scala
{code}
println(s"Creating graph...")
{code}

examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
{code}
println(s"Corpus summary:")
{code}

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala
{code}
test(s"correctly handle CREATE OR REPLACE TEMPORARY VIEW") {
{code}

We can add a new scala style rule 'NoRedundantStringInterpolator' to prevent 
unnecessary string interpolators. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682732#comment-15682732
 ] 

Dongjoon Hyun edited comment on SPARK-18515 at 11/21/16 7:10 AM:
-

~While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.~
~I'll make a PR for this for the Option 3 review with all-atomic type test.~

Hmm, it's has a side-effect and also is different from Hive. Sorry. Never mind 
this.


was (Author: dongjoon):
~~While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.~~
~~I'll make a PR for this for the Option 3 review with all-atomic type test.~~

Hmm, it's has a side-effect and also is different from Hive. Sorry. Never mind 
this.

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682732#comment-15682732
 ] 

Dongjoon Hyun edited comment on SPARK-18515 at 11/21/16 7:10 AM:
-

~~While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.~~
~~I'll make a PR for this for the Option 3 review with all-atomic type test.~~

Hmm, it's has a side-effect and also is different from Hive. Sorry. Never mind 
this.


was (Author: dongjoon):
While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.

The rationals are the followings.
- We only use the *name* of attribute and the *value* of literal.
- Generally, the user-given literal values are assumed to be used as a 
partition directory name without any conversion.

I'll make a PR for this for the Option 3 review with all-atomic type test.

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Comment Edited] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682732#comment-15682732
 ] 

Dongjoon Hyun edited comment on SPARK-18515 at 11/21/16 7:10 AM:
-

-While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.-
-I'll make a PR for this for the Option 3 review with all-atomic type test.-

Hmm, it's has a side-effect and also is different from Hive. Sorry. Never mind 
this.


was (Author: dongjoon):
~While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.~
~I'll make a PR for this for the Option 3 review with all-atomic type test.~

Hmm, it's has a side-effect and also is different from Hive. Sorry. Never mind 
this.

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing

2016-11-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-18450:
---
Component/s: ML

> Add AND-amplification to Locality Sensitive Hashing
> ---
>
> Key: SPARK-18450
> URL: https://issues.apache.org/jira/browse/SPARK-18450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We are changing the LSH transform API from {{Vector}} to {{Array of Vector}}. 
> Once the change is applied, we can add AND-amplification to LSH 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-18454:
---
Component/s: ML

> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-11-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-18408:
---
Component/s: ML

> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private
>  - Make Multi-Probe NN Search and {{hashDistance}} private for future 
> discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682732#comment-15682732
 ] 

Dongjoon Hyun commented on SPARK-18515:
---

While digging this, option 1 seems to increase complexity. As another option 
(Option 3), I want to propose converting arbitrary-type constant input literals 
into *string* literals at the beginning.

The rationals are the followings.
- We only use the *name* of attribute and the *value* of literal.
- Generally, the user-given literal values are assumed to be used as a 
partition directory name without any conversion.

I'll make a PR for this for the Option 3 review with all-atomic type test.

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2016-11-20 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682730#comment-15682730
 ] 

Takeshi Yamamuro commented on SPARK-18134:
--

In v1.5, the comparison and ordering of map-types has been dropped in 
SPARK-9415. Not exactly sure why this feature dropped though, ISTM one of 
reasons is a performance issue; we couldn't find an efficient way to compare 
map-typed data (see: https://github.com/apache/spark/pull/13847)

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Christian Zorneck
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-11-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-12978:

Affects Version/s: (was: 1.6.0)
 Target Version/s: 2.2.0

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18520) Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18520:


Assignee: (was: Apache Spark)

> Add missing setXXXCol methods for BisectingKMeansModel and 
> GaussianMixtureModel
> ---
>
> Key: SPARK-18520
> URL: https://issues.apache.org/jira/browse/SPARK-18520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> Add the missing methods for BisectingKMeansModel and GaussianMixtureModel:
> {{setFeaturesCol}} {{setPredictionCol}} {{setProbabilityCol}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18520) Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18520:


Assignee: Apache Spark

> Add missing setXXXCol methods for BisectingKMeansModel and 
> GaussianMixtureModel
> ---
>
> Key: SPARK-18520
> URL: https://issues.apache.org/jira/browse/SPARK-18520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> Add the missing methods for BisectingKMeansModel and GaussianMixtureModel:
> {{setFeaturesCol}} {{setPredictionCol}} {{setProbabilityCol}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18520) Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682692#comment-15682692
 ] 

Apache Spark commented on SPARK-18520:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15957

> Add missing setXXXCol methods for BisectingKMeansModel and 
> GaussianMixtureModel
> ---
>
> Key: SPARK-18520
> URL: https://issues.apache.org/jira/browse/SPARK-18520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> Add the missing methods for BisectingKMeansModel and GaussianMixtureModel:
> {{setFeaturesCol}} {{setPredictionCol}} {{setProbabilityCol}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18520) Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel

2016-11-20 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18520:


 Summary: Add missing setXXXCol methods for BisectingKMeansModel 
and GaussianMixtureModel
 Key: SPARK-18520
 URL: https://issues.apache.org/jira/browse/SPARK-18520
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng


Add the missing methods for BisectingKMeansModel and GaussianMixtureModel:
{{setFeaturesCol}} {{setPredictionCol}} {{setProbabilityCol}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18519) map type can not be used in binary comparison

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18519:


Assignee: Wenchen Fan  (was: Apache Spark)

> map type can not be used in binary comparison
> -
>
> Key: SPARK-18519
> URL: https://issues.apache.org/jira/browse/SPARK-18519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18519) map type can not be used in binary comparison

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18519:


Assignee: Apache Spark  (was: Wenchen Fan)

> map type can not be used in binary comparison
> -
>
> Key: SPARK-18519
> URL: https://issues.apache.org/jira/browse/SPARK-18519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18519) map type can not be used in binary comparison

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682677#comment-15682677
 ] 

Apache Spark commented on SPARK-18519:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15956

> map type can not be used in binary comparison
> -
>
> Key: SPARK-18519
> URL: https://issues.apache.org/jira/browse/SPARK-18519
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18519) map type can not be used in binary comparison

2016-11-20 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18519:
---

 Summary: map type can not be used in binary comparison
 Key: SPARK-18519
 URL: https://issues.apache.org/jira/browse/SPARK-18519
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18456) Use matrix abstraction for LogisitRegression coefficients during training

2016-11-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-18456.

   Resolution: Fixed
 Assignee: Seth Hendrickson
Fix Version/s: 2.1.0

> Use matrix abstraction for LogisitRegression coefficients during training
> -
>
> Key: SPARK-18456
> URL: https://issues.apache.org/jira/browse/SPARK-18456
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow up from 
> [SPARK-18060|https://issues.apache.org/jira/browse/SPARK-18060]. The current 
> code for logistic regression relies on manually indexing flat arrays of 
> column major coefficients, which can be messy and is hard to maintain. We can 
> use a matrix abstraction instead of a flat array to simplify things. This 
> will make the code easier to read and maintain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18518) HasSolver should support allowed values

2016-11-20 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18518:


 Summary: HasSolver should support allowed values
 Key: SPARK-18518
 URL: https://issues.apache.org/jira/browse/SPARK-18518
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Minor


{{HasSolver}} now don't support value validation, so it's not easy to use:
GLR and LiR inherit HasSolver, but need to do param validation explicitly like:
{code}
require(Set("auto", "l-bfgs", "normal").contains(value),
  s"Solver $value was not supported. Supported options: auto, l-bfgs, 
normal")
set(solver, value)
{code}

MLPC even don't inherit {{HasSolver}}, and create param solver with 
supportedSolvers:
{code}
  final val solver: Param[String] = new Param[String](this, "solver",
"The solver algorithm for optimization. Supported options: " +
  s"${MultilayerPerceptronClassifier.supportedSolvers.mkString(", ")}. 
(Default l-bfgs)",

ParamValidators.inArray[String](MultilayerPerceptronClassifier.supportedSolvers))
{code}
It may be reasonable to modify {{HasSolver}} after 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18434) Add missing ParamValidations for ML algos

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682601#comment-15682601
 ] 

Apache Spark commented on SPARK-18434:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15955

> Add missing ParamValidations for ML algos
> -
>
> Key: SPARK-18434
> URL: https://issues.apache.org/jira/browse/SPARK-18434
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.1.0
>
>
> add ParamValidations to make following lines fail:
> {code}
> scala> val idf = new IDF().setMinDocFreq(-100)
> idf: org.apache.spark.ml.feature.IDF = idf_4d2e6a4f2361
> scala> val pca = new PCA().setK(-100)
> pca: org.apache.spark.ml.feature.PCA = pca_7b22fbec5e97
> scala> val w2v = new Word2Vec().setVectorSize(-100)
> w2v: org.apache.spark.ml.feature.Word2Vec = w2v_06be869a20d9
> scala> val iso = new IsotonicRegression().setFeatureIndex(-100)
> iso: org.apache.spark.ml.regression.IsotonicRegression = isoReg_b9c59e9b6cbd
> scala> val lir = new LinearRegression().setSolver("1234")
> lir: org.apache.spark.ml.regression.LinearRegression = linReg_4e3c1c5e2904
> scala> val rfc = new RandomForestClassifier().setMinInfoGain(-100)
> rfc: org.apache.spark.ml.classification.RandomForestClassifier = 
> rfc_6db27a737216
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17591) Fix/investigate the failure of tests in Scala On Windows

2016-11-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682578#comment-15682578
 ] 

Hyukjin Kwon commented on SPARK-17591:
--

I will close this when I am able to proceed further the tests on Windows and to 
see more error logs rather then the ones described in the description.

> Fix/investigate the failure of tests in Scala On Windows
> 
>
> Key: SPARK-17591
> URL: https://issues.apache.org/jira/browse/SPARK-17591
> Project: Spark
>  Issue Type: Test
>  Components: Build, DStreams, Spark Core, SQL
>Reporter: Hyukjin Kwon
>
> {code}
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec 
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
> FAILURE!
> java.lang.AssertionError: 
> expected: > but was:
>   at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
> {code}
> {code}
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec <<< 
> FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
> testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time 
> elapsed: 0.047 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<1>
>   at 
> org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
> {code}
> {code}
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> Running org.apache.spark.streaming.JavaDurationSuite
> {code}
> {code}
> Running org.apache.spark.streaming.JavaAPISuite
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> {code}
> {code}
> Results :
> Tests in error: 
>   JavaAPISuite.testCheckpointMasterRecovery:1808 � IO Failed to delete: 
> C:\proje...
> Tests run: 74, Failures: 0, Errors: 1, Skipped: 0
> {code}
> The tests were aborted for unknown reason during SQL tests - 
> {{BroadcastJoinSuite}} emitting the exceptions below continuously:
> {code}
> 20:48:09.876 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error 
> running executor
> java.io.IOException: Cannot run program "C:\Progra~1\Java\jdk1.8.0\bin\java" 
> (in directory "C:\projects\spark\work\app-20160918204809-\0"): 
> CreateProcess error=206, The filename or extension is too long
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
> Caused by: java.io.IOException: CreateProcess error=206, The filename or 
> extension is too long
>   at java.lang.ProcessImpl.create(Native Method)
>   at java.lang.ProcessImpl.(ProcessImpl.java:386)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:137)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 2 more
> {code}
> Here is the full log for the test - 
> https://ci.appveyor.com/project/spark-test/spark/build/15-scala-tests
> We may have to create sub-tasks if these are actual issues on Windows rather 
> than just mistakes in tests.
> I am willing to test this again after fixing some issues here in particular 
> the last one.
> I trigger the build by the comments below:
> {code}
> mvn -DskipTests -Phadoop-2.6 -Phive -Phive-thriftserver package
> mvn -Phadoop-2.6 -Phive -Phive-thriftserver --fail-never test
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18516:


Assignee: Michael Armbrust  (was: Apache Spark)

> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  
> Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
> you retrieve a status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18516:


Assignee: Apache Spark  (was: Michael Armbrust)

> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  
> Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
> you retrieve a status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682564#comment-15682564
 ] 

Apache Spark commented on SPARK-18516:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/15954

> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  
> Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
> you retrieve a status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18517) DROP TABLE IF EXISTS should not warn for non-existing tables

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18517:


Assignee: Apache Spark

> DROP TABLE IF EXISTS should not warn for non-existing tables
> 
>
> Key: SPARK-18517
> URL: https://issues.apache.org/jira/browse/SPARK-18517
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. 
> However, it had better be quiet for this case by definition of the command.
> {code}
> scala> sql("DROP TABLE IF EXISTS nonexist")
> 16/11/20 20:48:26 WARN DropTableCommand: 
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'nonexist' not found in database 'default';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18517) DROP TABLE IF EXISTS should not warn for non-existing tables

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682523#comment-15682523
 ] 

Apache Spark commented on SPARK-18517:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15953

> DROP TABLE IF EXISTS should not warn for non-existing tables
> 
>
> Key: SPARK-18517
> URL: https://issues.apache.org/jira/browse/SPARK-18517
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. 
> However, it had better be quiet for this case by definition of the command.
> {code}
> scala> sql("DROP TABLE IF EXISTS nonexist")
> 16/11/20 20:48:26 WARN DropTableCommand: 
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'nonexist' not found in database 'default';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18517) DROP TABLE IF EXISTS should not warn for non-existing tables

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18517:


Assignee: (was: Apache Spark)

> DROP TABLE IF EXISTS should not warn for non-existing tables
> 
>
> Key: SPARK-18517
> URL: https://issues.apache.org/jira/browse/SPARK-18517
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. 
> However, it had better be quiet for this case by definition of the command.
> {code}
> scala> sql("DROP TABLE IF EXISTS nonexist")
> 16/11/20 20:48:26 WARN DropTableCommand: 
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'nonexist' not found in database 'default';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18517) DROP TABLE IF EXISTS should not warn for non-existing tables

2016-11-20 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-18517:
-

 Summary: DROP TABLE IF EXISTS should not warn for non-existing 
tables
 Key: SPARK-18517
 URL: https://issues.apache.org/jira/browse/SPARK-18517
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Dongjoon Hyun
Priority: Minor


Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. 
However, it had better be quiet for this case by definition of the command.

{code}
scala> sql("DROP TABLE IF EXISTS nonexist")
16/11/20 20:48:26 WARN DropTableCommand: 
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
'nonexist' not found in database 'default';
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18023) Adam optimizer

2016-11-20 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682491#comment-15682491
 ] 

Vincent commented on SPARK-18023:
-

thanks [~mlnick]
that's really what we need. when I wrote the code for Adagrad, I do find 
some conflicts with original design. These new optimizers do not share a common 
API with what we have now in mllib, and also with a different workflow, it's 
hard to fit in and make a good PR without changing the original design, so I 
just made a package instead for now.

> Adam optimizer
> --
>
> Key: SPARK-18023
> URL: https://issues.apache.org/jira/browse/SPARK-18023
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Vincent
>Priority: Minor
>
> It could be incredibly slow for SGD methods to diverge or converge if their  
> learning rate alpha are set inappropriately, many alternative methods have 
> been proposed to produce desirable convergence with less dependence on 
> hyperparameter settings, and to help prevent local optimum, e.g. Momentom, 
> NAG (Nesterov's Accelerated Gradient), Adagrad, RMSProp etc.
> Among which, Adam is one of the popular algorithms, which is for first-order 
> gradient-based optimization of stochastic objective functions. It's proved to 
> be well suited for problems with large data and/or parameters, and for 
> problems with noisy and/or sparse gradients and is computationally efficient. 
> Refer to this paper for details
> In fact, Tensorflow has implemented most of the adaptive optimization methods 
> mentioned, and we have seen that Adam out performs most of SGD methods in 
> certain cases, such as very sparse dataset in a FM model.
> It could be nice for Spark to have these adaptive optimization methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18023) Adam optimizer

2016-11-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682471#comment-15682471
 ] 

Nick Pentreath commented on SPARK-18023:


Linking SPARK-17136 which is really a blocker for adding any optimization 
methods. We first need to design a good API for pluggable optimizers, then work 
on adding some more advanced options. We can take a look at other libs in R, 
Python and e.g. TensorFlow to get some ideas on how they have designed these 
interfaces.

> Adam optimizer
> --
>
> Key: SPARK-18023
> URL: https://issues.apache.org/jira/browse/SPARK-18023
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Vincent
>Priority: Minor
>
> It could be incredibly slow for SGD methods to diverge or converge if their  
> learning rate alpha are set inappropriately, many alternative methods have 
> been proposed to produce desirable convergence with less dependence on 
> hyperparameter settings, and to help prevent local optimum, e.g. Momentom, 
> NAG (Nesterov's Accelerated Gradient), Adagrad, RMSProp etc.
> Among which, Adam is one of the popular algorithms, which is for first-order 
> gradient-based optimization of stochastic objective functions. It's proved to 
> be well suited for problems with large data and/or parameters, and for 
> problems with noisy and/or sparse gradients and is computationally efficient. 
> Refer to this paper for details
> In fact, Tensorflow has implemented most of the adaptive optimization methods 
> mentioned, and we have seen that Adam out performs most of SGD methods in 
> certain cases, such as very sparse dataset in a FM model.
> It could be nice for Spark to have these adaptive optimization methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16377) Spark MLlib: MultilayerPerceptronClassifier - error while training

2016-11-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682458#comment-15682458
 ] 

Nick Pentreath commented on SPARK-16377:


Is this still a bug? As per your above comment seems we can close this as "Not 
an Issue"?

> Spark MLlib: MultilayerPerceptronClassifier - error while training
> --
>
> Key: SPARK-16377
> URL: https://issues.apache.org/jira/browse/SPARK-16377
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.2
>Reporter: Mikhail Shiryaev
>
> Hi, 
> I am trying to train model by MultilayerPerceptronClassifier. 
> It works on sample data from 
> data/mllib/sample_multiclass_classification_data.txt with 4 features, 3 
> classes and layers [4, 4, 3]. 
> But when I try to use other input files with other features and classes (from 
> here for example: 
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) 
> then I get errors. 
> Example: 
> Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]): 
> with block size = 1: 
> ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. 
> Decreasing step size to Infinity 
> ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: 
> Line search failed 
> ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is 
> just poorly behaved? 
> with default block size = 128: 
>  java.lang.ArrayIndexOutOfBoundsException 
>   at java.lang.System.arraycopy(Native Method) 
>   at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
>  
>   at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
>  
>at scala.collection.immutable.List.foreach(List.scala:381) 
>at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
>  
>at 
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)
>  
> Even if I modify sample_multiclass_classification_data.txt file (rename all 
> 4-th features to 5-th) and run with layers [5, 5, 3] then I also get the same 
> errors as for file above. 
> So to resume: 
> I can't run training with default block size and with more than 4 features. 
> If I set  block size to 1 then some actions are happened but I get errors 
> from LBFGS. 
> It is reproducible with Spark 1.5.2 and from master branch on github (from 
> 4-th July). 
> Did somebody already met with such behavior? 
> Is there bug in MultilayerPerceptronClassifier or I use it incorrectly? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6346) Use faster converging optimization method in MLlib

2016-11-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682455#comment-15682455
 ] 

Nick Pentreath commented on SPARK-6346:
---

I think we can close this ticket? It's pretty old, and everything in {{ml}} 
that can use L-BFGS now does, yes?

> Use faster converging optimization method in MLlib
> --
>
> Key: SPARK-6346
> URL: https://issues.apache.org/jira/browse/SPARK-6346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Reza Zadeh
>
> According to experiments in SPARK-1503, the LBFGS algorithm converges much 
> faster than our current proximal gradient, which is used throughout MLlib. 
> This ticket is to track replacing slower-converging algorithms, with faster 
> components e.g. LBFGS
> This needs unification of the Optimization interface. For example, the LBFGS 
> implementation should not know about RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-11-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-12978:
-

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18467) Refactor StaticInvoke, Invoke and NewInstance.

2016-11-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18467.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Refactor StaticInvoke, Invoke and NewInstance.
> --
>
> Key: SPARK-18467
> URL: https://issues.apache.org/jira/browse/SPARK-18467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> Refactor {{StaticInvoke}}, {{Invoke}} and {{NewInstance}} as:
> - Introduce {{InvokeLike}} to extract common logic from {{StaticInvoke}}, 
> {{Invoke}} and {{NewInstance}} to prepare arguments.
> - Remove unneeded null checking and fix nullability of {{NewInstance}}.
> - Modify to short circuit if arguments have {{null}} when {{propageteNull == 
> true}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18467) Refactor StaticInvoke, Invoke and NewInstance.

2016-11-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18467:

Assignee: Takuya Ueshin

> Refactor StaticInvoke, Invoke and NewInstance.
> --
>
> Key: SPARK-18467
> URL: https://issues.apache.org/jira/browse/SPARK-18467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> Refactor {{StaticInvoke}}, {{Invoke}} and {{NewInstance}} as:
> - Introduce {{InvokeLike}} to extract common logic from {{StaticInvoke}}, 
> {{Invoke}} and {{NewInstance}} to prepare arguments.
> - Remove unneeded null checking and fix nullability of {{NewInstance}}.
> - Modify to short circuit if arguments have {{null}} when {{propageteNull == 
> true}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2941) Add config option to support NIO vs OIO in Netty network module

2016-11-20 Thread Qiao Yifan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682355#comment-15682355
 ] 

Qiao Yifan commented on SPARK-2941:
---

where is the config file locates? I can't find it

> Add config option to support NIO vs OIO in Netty network module
> ---
>
> Key: SPARK-2941
> URL: https://issues.apache.org/jira/browse/SPARK-2941
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.1.0
>
>
> Add config option spark.shuffle.io.mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18514) Fix `Note:`/`NOTE:`/`Note that` across R API documentation

2016-11-20 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18514:
-
Priority: Trivial  (was: Minor)

> Fix `Note:`/`NOTE:`/`Note that` across R API documentation
> --
>
> Key: SPARK-18514
> URL: https://issues.apache.org/jira/browse/SPARK-18514
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Same contents from the parent but only for R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18514) Fix `Note:`/`NOTE:`/`Note that` across R API documentation

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18514:


Assignee: Apache Spark

> Fix `Note:`/`NOTE:`/`Note that` across R API documentation
> --
>
> Key: SPARK-18514
> URL: https://issues.apache.org/jira/browse/SPARK-18514
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Same contents from the parent but only for R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18514) Fix `Note:`/`NOTE:`/`Note that` across R API documentation

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18514:


Assignee: (was: Apache Spark)

> Fix `Note:`/`NOTE:`/`Note that` across R API documentation
> --
>
> Key: SPARK-18514
> URL: https://issues.apache.org/jira/browse/SPARK-18514
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Same contents from the parent but only for R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18514) Fix `Note:`/`NOTE:`/`Note that` across R API documentation

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682286#comment-15682286
 ] 

Apache Spark commented on SPARK-18514:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15952

> Fix `Note:`/`NOTE:`/`Note that` across R API documentation
> --
>
> Key: SPARK-18514
> URL: https://issues.apache.org/jira/browse/SPARK-18514
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Same contents from the parent but only for R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18516:
-
Description: 
There are two types of information that you want to be able to extract from a 
running query: instantaneous _status_ and metrics about the performance as make 
_progress_ in query processing.

Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
downside to this approach is that a user now needs to reason about what state 
the query is in anytime they retrieve a status object.  Fields like 
{{statusMessage}} don't appear in updates that come from listener bus.  And 
inputRate/processingRate statistics are usually {{0}} when you retrieve a 
status object from the query itself.

I propose we make the follow changes:
 - Make {{status}} only report instantaneous things, such as if data is 
available or a human readable message about what phase we are currently in.
 - Have a separate {{progress}} message that we report for each trigger with 
the other performance information that lives in status today.  You should be 
able to easily retrieve a configurable number of the most recent progress 
messages instead of just the most recent.

While we are making these changes, I propose that we also change {{id}} to be a 
globally unique identifier, rather than a JVM unique one.  Without this its 
hard to correlate performance across restarts.

  was:
There are two types of information that you want to be able to extract from a 
running query: instantaneous _status_ and metrics about the performance as make 
_progress_ in query processing.

Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
downside to this approach is that a user now needs to reason about what state 
the query is in anytime they retrieve a status object.  Fields like 
{{statusMessage}} don't appear in messages that come from listener bus.  And 
inputRate/processingRate statistics are usually {{0}} when you retrieve a 
status object from the query itself.

I propose we make the follow changes:
 - Make {{status}} only report instantaneous things, such as if data is 
available or a human readable message about what phase we are currently in.
 - Have a separate {{progress}} message that we report for each trigger with 
the other performance information that lives in status today.  You should be 
able to easily retrieve a configurable number of the most recent progress 
messages instead of just the most recent.

While we are making these changes, I propose that we also change {{id}} to be a 
globally unique identifier, rather than a JVM unique one.  Without this its 
hard to correlate performance across restarts.


> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  And 
> inputRate/processingRate statistics are usually {{0}} when you retrieve a 
> status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18516:
-
Description: 
There are two types of information that you want to be able to extract from a 
running query: instantaneous _status_ and metrics about the performance as make 
_progress_ in query processing.

Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
downside to this approach is that a user now needs to reason about what state 
the query is in anytime they retrieve a status object.  Fields like 
{{statusMessage}} don't appear in updates that come from listener bus.  
Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
you retrieve a status object from the query itself.

I propose we make the follow changes:
 - Make {{status}} only report instantaneous things, such as if data is 
available or a human readable message about what phase we are currently in.
 - Have a separate {{progress}} message that we report for each trigger with 
the other performance information that lives in status today.  You should be 
able to easily retrieve a configurable number of the most recent progress 
messages instead of just the most recent.

While we are making these changes, I propose that we also change {{id}} to be a 
globally unique identifier, rather than a JVM unique one.  Without this its 
hard to correlate performance across restarts.

  was:
There are two types of information that you want to be able to extract from a 
running query: instantaneous _status_ and metrics about the performance as make 
_progress_ in query processing.

Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
downside to this approach is that a user now needs to reason about what state 
the query is in anytime they retrieve a status object.  Fields like 
{{statusMessage}} don't appear in updates that come from listener bus.  And 
inputRate/processingRate statistics are usually {{0}} when you retrieve a 
status object from the query itself.

I propose we make the follow changes:
 - Make {{status}} only report instantaneous things, such as if data is 
available or a human readable message about what phase we are currently in.
 - Have a separate {{progress}} message that we report for each trigger with 
the other performance information that lives in status today.  You should be 
able to easily retrieve a configurable number of the most recent progress 
messages instead of just the most recent.

While we are making these changes, I propose that we also change {{id}} to be a 
globally unique identifier, rather than a JVM unique one.  Without this its 
hard to correlate performance across restarts.


> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  
> Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
> you retrieve a status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-20 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682216#comment-15682216
 ] 

Cody Koeninger commented on SPARK-18506:


I tried your example code on an AWS 2-node spark standalone cluster, still not 
able to reproduce the issue.

[ec2-user@ip-10-0-2-58 spark-2.0.2-bin-hadoop2.7]$ ./bin/spark-submit --master 
spark://ip-10-0-2-58.ec2.internal:7077 --class example.SimpleKafkaLoggingDriver 
~/kafka-example-assembly-2.0.0.jar 10.0.2.96:9092 simple_logtest mygroup 
earliest

16/11/21 01:41:31 INFO JobScheduler: Added jobs for time 147969249 ms
simple_logtest 3 offsets: 0 to 62
simple_logtest 0 offsets: 0 to 61
simple_logtest 1 offsets: 0 to 62
simple_logtest 2 offsets: 0 to 61
simple_logtest 4 offsets: 0 to 62
16/11/21 01:41:31 INFO JobScheduler: Finished job streaming job 147969249 
ms.0 from job set of time 147969249 ms
16/11/21 01:41:31 INFO ReceivedBlockTracker: Deleting batches: 
16/11/21 01:41:31 INFO JobScheduler: Total delay: 1.946 s for time 
147969249 ms (execution: 0.009 s)
16/11/21 01:41:32 INFO InputInfoTracker: remove old batch metadata: 
simple_logtest 3 offsets: 62 to 62
simple_logtest 0 offsets: 61 to 61
simple_logtest 1 offsets: 62 to 62
simple_logtest 2 offsets: 61 to 61
simple_logtest 4 offsets: 62 to 62
16/11/21 01:41:35 INFO JobScheduler: Starting job streaming job 1479692495000 
ms.0 from job set of time 1479692495000 ms

What happens when you use ConsumerStrategies.Assign to start at 0 for the 
partitions in question?

> kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a 
> single partition on a multi partition topic
> ---
>
> Key: SPARK-18506
> URL: https://issues.apache.org/jira/browse/SPARK-18506
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark 
> standalone mode 2.0.2 
> with Kafka 0.10.1.0.   
>Reporter: Heji Kim
>
> Our team is trying to upgrade to Spark 2.0.2/Kafka 
> 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
> drivers to read all partitions of a single stream when kafka 
> auto.offset.reset=earliest running on a real cluster(separate VM nodes). 
> When we run our drivers with auto.offset.reset=latest ingesting from a single 
> kafka topic with multiple partitions (usually 10 but problem shows up  with 
> only 3 partitions), the driver reads correctly from all partitions.  
> Unfortunately, we need "earliest" for exactly once semantics.
> In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
> spark-streaming-kafka-0-8_2.11 with the prior setting 
> auto.offset.reset=smallest runs correctly.
> We have tried the following configurations in trying to isolate our problem 
> but it is only auto.offset.reset=earliest on a "real multi-machine cluster" 
> which causes this problem.
> 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  
> instead of YARN 2.7.3. Single partition read problem persists both cases. 
> Please note this problem occurs on an actual cluster of separate VM nodes 
> (but not when our engineer runs in as a cluster on his own Mac.)
> 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
> 3. Turned off checkpointing. Problem persists with or without checkpointing.
> 4. Turned off backpressure. Problem persists with or without backpressure.
> 5. Tried both partition.assignment.strategy RangeAssignor and 
> RoundRobinAssignor. Broken with both.
> 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
> both.
> 7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
> Broken with both.
> 8. Tried increasing GCE capacity for cluster but already we were highly 
> overprovisioned for cores and memory. Also tried ramping up executors and 
> cores.  Since driver works with auto.offset.reset=latest, we have ruled out 
> GCP cloud infrastructure issues.
> When we turn on the debug logs, we sometimes see partitions being set to 
> different offset configuration even though the consumer config correctly 
> indicates auto.offset.reset=earliest. 
> {noformat}
> 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
>  to broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Sending ListOffsetRequest 
> 

[jira] [Created] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-20 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18516:


 Summary: Separate instantaneous state from progress performance 
statistics
 Key: SPARK-18516
 URL: https://issues.apache.org/jira/browse/SPARK-18516
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker


There are two types of information that you want to be able to extract from a 
running query: instantaneous _status_ and metrics about the performance as make 
_progress_ in query processing.

Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
downside to this approach is that a user now needs to reason about what state 
the query is in anytime they retrieve a status object.  Fields like 
{{statusMessage}} don't appear in messages that come from listener bus.  And 
inputRate/processingRate statistics are usually {{0}} when you retrieve a 
status object from the query itself.

I propose we make the follow changes:
 - Make {{status}} only report instantaneous things, such as if data is 
available or a human readable message about what phase we are currently in.
 - Have a separate {{progress}} message that we report for each trigger with 
the other performance information that lives in status today.  You should be 
able to easily retrieve a configurable number of the most recent progress 
messages instead of just the most recent.

While we are making these changes, I propose that we also change {{id}} to be a 
globally unique identifier, rather than a JVM unique one.  Without this its 
hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-20 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682171#comment-15682171
 ] 

Cody Koeninger commented on SPARK-18475:


An iterator certainly does have an ordering guarantee, and it's pretty 
straightforward to figure out whether a given operation shuffles.  Plenty of 
jobs have been written depending on that ordering guarantee, and it's 
documented for the Direct Stream.

The only reason it's a significant performance improvement is because the OP is 
mis-using kafka.  If he had reasonably even production into a reasonable number 
of partitions, there would be no performance improvement.

You guys might be able to convince Michael this is a good idea, but as I said, 
this isn't the first time this has come up, and my answer isn't likely to 
change.  I'm not "blocking" anything, I'm not a gatekeeper and have no more 
rights than you do.  I just think it's a really bad idea.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18510:


Assignee: (was: Apache Spark)

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18510:


Assignee: Apache Spark

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681953#comment-15681953
 ] 

Apache Spark commented on SPARK-18510:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15951

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17062) Add --conf to mesos dispatcher process

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681890#comment-15681890
 ] 

Apache Spark commented on SPARK-17062:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15950

> Add --conf to mesos dispatcher process
> --
>
> Key: SPARK-17062
> URL: https://issues.apache.org/jira/browse/SPARK-17062
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
> Fix For: 2.2.0
>
>
> Sometimes we simply need to add a property in Spark Config for the Mesos 
> Dispatcher. The only option right now is to created a property file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-20 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17732:
--
Fix Version/s: (was: 2.1.0)
   2.2.0

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18339:


Assignee: Tyson Condie  (was: Apache Spark)

> Don't push down current_timestamp for filters in StructuredStreaming
> 
>
> Key: SPARK-18339
> URL: https://issues.apache.org/jira/browse/SPARK-18339
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Assignee: Tyson Condie
>
> For the following workflow:
> 1. I have a column called time which is at minute level precision in a 
> Streaming DataFrame
> 2. I want to perform groupBy time, count
> 3. Then I want my MemorySink to only have the last 30 minutes of counts and I 
> perform this by
> {code}
> .where('time >= current_timestamp().cast("long") - 30 * 60)
> {code}
> what happens is that the `filter` gets pushed down before the aggregation, 
> and the filter happens on the source data for the aggregation instead of the 
> result of the aggregation (where I actually want to filter).
> I guess the main issue here is that `current_timestamp` is non-deterministic 
> in the streaming context and shouldn't be pushed down the filter.
> Does this require us to store the `current_timestamp` for each trigger of the 
> streaming job, that is something to discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681821#comment-15681821
 ] 

Apache Spark commented on SPARK-18339:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/15949

> Don't push down current_timestamp for filters in StructuredStreaming
> 
>
> Key: SPARK-18339
> URL: https://issues.apache.org/jira/browse/SPARK-18339
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Assignee: Tyson Condie
>
> For the following workflow:
> 1. I have a column called time which is at minute level precision in a 
> Streaming DataFrame
> 2. I want to perform groupBy time, count
> 3. Then I want my MemorySink to only have the last 30 minutes of counts and I 
> perform this by
> {code}
> .where('time >= current_timestamp().cast("long") - 30 * 60)
> {code}
> what happens is that the `filter` gets pushed down before the aggregation, 
> and the filter happens on the source data for the aggregation instead of the 
> result of the aggregation (where I actually want to filter).
> I guess the main issue here is that `current_timestamp` is non-deterministic 
> in the streaming context and shouldn't be pushed down the filter.
> Does this require us to store the `current_timestamp` for each trigger of the 
> streaming job, that is something to discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18339) Don't push down current_timestamp for filters in StructuredStreaming

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18339:


Assignee: Apache Spark  (was: Tyson Condie)

> Don't push down current_timestamp for filters in StructuredStreaming
> 
>
> Key: SPARK-18339
> URL: https://issues.apache.org/jira/browse/SPARK-18339
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> For the following workflow:
> 1. I have a column called time which is at minute level precision in a 
> Streaming DataFrame
> 2. I want to perform groupBy time, count
> 3. Then I want my MemorySink to only have the last 30 minutes of counts and I 
> perform this by
> {code}
> .where('time >= current_timestamp().cast("long") - 30 * 60)
> {code}
> what happens is that the `filter` gets pushed down before the aggregation, 
> and the filter happens on the source data for the aggregation instead of the 
> result of the aggregation (where I actually want to filter).
> I guess the main issue here is that `current_timestamp` is non-deterministic 
> in the streaming context and shouldn't be pushed down the filter.
> Does this require us to store the `current_timestamp` for each trigger of the 
> streaming job, that is something to discuss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-20 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681677#comment-15681677
 ] 

Burak Yavuz commented on SPARK-18510:
-

No. Working on a separate fix

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681626#comment-15681626
 ] 

Dongjoon Hyun commented on SPARK-18515:
---

Yes, indeed. Thank you for advice. Also I think we need all type enumerating 
test for this to make it sure.

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681605#comment-15681605
 ] 

Herman van Hovell commented on SPARK-18515:
---

The Analyzer is injecting casts because the dataType of the partition attribute 
is a String: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L223

This is not trivial to solve. If we want to properly fix this, we need to use 
UnresolvedAttributes and do some add some analysis for this command. Another - 
hackier - way of getting there would, to explicitly exclude this command from 
analysis.

I am personally a proponent of option 1. 

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681570#comment-15681570
 ] 

Apache Spark commented on SPARK-17732:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15948

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681480#comment-15681480
 ] 

Dongjoon Hyun commented on SPARK-18515:
---

Ah, not only `Cast` occurs, but also the literal also is changed.
{code}
(cast(p#224 as double) = 10.0) (of class 
org.apache.spark.sql.catalyst.expressions.EqualTo)
scala.MatchError: (cast(p#224 as double) = 10.0) (of class 
org.apache.spark.sql.catalyst.expressions.EqualTo)
{code}

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681450#comment-15681450
 ] 

Dongjoon Hyun edited comment on SPARK-18515 at 11/20/16 4:50 PM:
-

I see, [~hvanhovell].
Anyway, I'll fix this today in master.
Thank you, [~hvanhovell].


was (Author: dongjoon):
I see, [~hvanhovell].
Anyway, I'll fix this today.
Thank you, [~hvanhovell].

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681450#comment-15681450
 ] 

Dongjoon Hyun commented on SPARK-18515:
---

I see, [~hvanhovell].
Anyway, I'll fix this today.
Thank you, [~hvanhovell].

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681435#comment-15681435
 ] 

Herman van Hovell commented on SPARK-18515:
---

[~dongjoon] I am reverting this from branch-2.1: 
https://github.com/apache/spark/commit/1126c3194ee1c79015cf1d3808bc963aa93dcadf

> AlterTableDropPartitions fails for non-string columns
> -
>
> Key: SPARK-18515
> URL: https://issues.apache.org/jira/browse/SPARK-18515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
>
> AlterTableDropPartitions fails with a scala MatchError if you use non-string 
> partitioning columns:
> {noformat}
> spark.sql("drop table if exists tbl_x")
> spark.sql("create table tbl_x (a int) partitioned by (p int)")
> spark.sql("alter table tbl_x add partition (p=10)")
> spark.sql("alter table tbl_x drop partition (p=10)")
> {noformat}
> Yields the following error:
> {noformat}
> scala.MatchError: (cast(p#8 as int) = 10) (of class 
> org.apache.spark.sql.catalyst.expressions.EqualTo)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:185)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
>   ... 39 elided
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18515) AlterTableDropPartitions fails for non-string columns

2016-11-20 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-18515:
-

 Summary: AlterTableDropPartitions fails for non-string columns
 Key: SPARK-18515
 URL: https://issues.apache.org/jira/browse/SPARK-18515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Herman van Hovell
Assignee: Dongjoon Hyun


AlterTableDropPartitions fails with a scala MatchError if you use non-string 
partitioning columns:
{noformat}
spark.sql("drop table if exists tbl_x")
spark.sql("create table tbl_x (a int) partitioned by (p int)")
spark.sql("alter table tbl_x add partition (p=10)")
spark.sql("alter table tbl_x drop partition (p=10)")
{noformat}
Yields the following error:
{noformat}
scala.MatchError: (cast(p#8 as int) = 10) (of class 
org.apache.spark.sql.catalyst.expressions.EqualTo)
  at 
org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
  at 
org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10$$anonfun$11.apply(ddl.scala:462)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:462)
  at 
org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand$$anonfun$10.apply(ddl.scala:461)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand.run(ddl.scala:461)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:591)
  ... 39 elided
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18514) Fix `Note:`/`NOTE:`/`Note that` across R API documentation

2016-11-20 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18514:


 Summary: Fix `Note:`/`NOTE:`/`Note that` across R API documentation
 Key: SPARK-18514
 URL: https://issues.apache.org/jira/browse/SPARK-18514
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Reporter: Hyukjin Kwon
Priority: Minor


Same contents from the parent but only for R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18447) Fix `Note:`/`NOTE:`/`Note that` across Python API documentation

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681396#comment-15681396
 ] 

Apache Spark commented on SPARK-18447:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15947

> Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
> ---
>
> Key: SPARK-18447
> URL: https://issues.apache.org/jira/browse/SPARK-18447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Aditya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18513) Record and recover watermark

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18513:


Assignee: (was: Apache Spark)

> Record and recover watermark
> 
>
> Key: SPARK-18513
> URL: https://issues.apache.org/jira/browse/SPARK-18513
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Liwei Lin
>
> We should record the watermark into the persistent log and recover it to 
> ensure determinism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18513) Record and recover watermark

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681385#comment-15681385
 ] 

Apache Spark commented on SPARK-18513:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15946

> Record and recover watermark
> 
>
> Key: SPARK-18513
> URL: https://issues.apache.org/jira/browse/SPARK-18513
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Liwei Lin
>
> We should record the watermark into the persistent log and recover it to 
> ensure determinism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18513) Record and recover watermark

2016-11-20 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-18513:
--
Summary: Record and recover watermark  (was: Record and recover waterwark)

> Record and recover watermark
> 
>
> Key: SPARK-18513
> URL: https://issues.apache.org/jira/browse/SPARK-18513
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Liwei Lin
>
> We should record the watermark into the persistent log and recover it to 
> ensure determinism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18513) Record and recover watermark

2016-11-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18513:


Assignee: Apache Spark

> Record and recover watermark
> 
>
> Key: SPARK-18513
> URL: https://issues.apache.org/jira/browse/SPARK-18513
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Liwei Lin
>Assignee: Apache Spark
>
> We should record the watermark into the persistent log and recover it to 
> ensure determinism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18513) Record and recover waterwark

2016-11-20 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-18513:
-

 Summary: Record and recover waterwark
 Key: SPARK-18513
 URL: https://issues.apache.org/jira/browse/SPARK-18513
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Liwei Lin


We should record the watermark into the persistent log and recover it to ensure 
determinism.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-11-20 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-16998.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.2.0

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow

2016-11-20 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681374#comment-15681374
 ] 

Takeshi Yamamuro commented on SPARK-16998:
--

[~hvanhovell] Since SPARK-15214 improves this query by ~11x, I think we can 
also close this ticket;
https://github.com/apache/spark/pull/13065/files#diff-b7bf86a20a79d572f81093300568db6eR152

> select($"column1", explode($"column2")) is extremely slow
> -
>
> Key: SPARK-16998
> URL: https://issues.apache.org/jira/browse/SPARK-16998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: TobiasP
>
> Using a Dataset containing 10.000 rows, each containing null and an array of 
> 5.000 Ints, I observe the following performance (in local mode):
> {noformat}
> scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect)
> 1.219052 seconds  
>   
> res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196])
> scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, 
> 1).collect)
> 20.219447 seconds 
>   
> res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], 
> [null,3196])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681349#comment-15681349
 ] 

Apache Spark commented on SPARK-12978:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/15945

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 12:06 PM:
-

Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.1 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
update
***
It seems that in spark 2.1 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
update
***
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.




> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
*** update
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.




> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> *** update
> It seems that in spark 2.0 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
update
***
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
*** update
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.




> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.0 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Priority: Minor  (was: Major)

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 11:30 AM:
-

Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 11:29 AM:
-

Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure.
Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-20 Thread Giuseppe Bonaccorso (JIRA)
Giuseppe Bonaccorso created SPARK-18512:
---

 Summary: FileNotFoundException on _temporary directory with Spark 
Streaming 2.0.1 and S3A
 Key: SPARK-18512
 URL: https://issues.apache.org/jira/browse/SPARK-18512
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.1
 Environment: AWS EMR 5.0.1
Spark 2.0.1
S3 EU-West-1 (S3A with read-after-write consistency)
Reporter: Giuseppe Bonaccorso


After a few hours of streaming processing and data saving in Parquet format, I 
got always this exception:

{code:java}
java.io.FileNotFoundException: No such file or directory: 
s3a://xxx/_temporary/0/task_
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
{code}

I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680994#comment-15680994
 ] 

Ran Haim commented on SPARK-17436:
--

Sure.
Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2016-11-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680887#comment-15680887
 ] 

Sean Owen commented on SPARK-3359:
--

Current state: lots of PRs over time have removed most errors/warnings, but 
still some remain. Getting closer.

> `sbt/sbt unidoc` doesn't work with Java 8
> -
>
> Key: SPARK-3359
> URL: https://issues.apache.org/jira/browse/SPARK-3359
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It seems that Java 8 is stricter on JavaDoc. I got many error messages like
> {code}
> [error] 
> /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
>  error: modifier private not allowed here
> [error] private abstract interface SparkHadoopMapRedUtil {
> [error]  ^
> {code}
> This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18511) Add an api for join operation with just the column name and join type.

2016-11-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680822#comment-15680822
 ] 

Apache Spark commented on SPARK-18511:
--

User 'shiv4nsh' has created a pull request for this issue:
https://github.com/apache/spark/pull/15944

> Add an api for join operation with just the column name and join type.
> --
>
> Key: SPARK-18511
> URL: https://issues.apache.org/jira/browse/SPARK-18511
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.0.2
>Reporter: Shivansh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-20 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680780#comment-15680780
 ] 

Ofir Manor commented on SPARK-18475:


I understand, but I disagree with you. I think you are mixing two very 
different things.
KAFKA does provide order guarantee within a Kafka partition. That is a basic 
design attribute of it.
SPARK RDD however, does not have any order guarantees. When you create and 
manipulate RDDs / Dataframes (regardless of their origin, Kafka or other), 
Spark does not explicitly tells you what is the order of elements. For example, 
operators on RDD do not specify whether they keep or break the internal order 
(and some of them do), there are no order-sensitive operators, there is no 
general-purpose order metadata etc.
I agree that with Kafka source, under some circumstances (like non-shuffling 
operators), the output of RDD manipulation may very well be in accordance to 
the input order (within an input Kafka partition). BUT, I never saw anywhere 
that this is an explicit guarantee that users can rely on, and not an artifact 
of the internal, current implementation.
This is relevant because you want to block here a potentially significant 
performance improvement (one relying on explicit, non-default configuration), 
just to maintain a property that Spark does not guarantee to keep.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org