[jira] [Commented] (SPARK-17501) Re-register BlockManager again and again

2016-09-11 Thread Jagadeesan A S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483144#comment-15483144
 ] 

Jagadeesan A S commented on SPARK-17501:


 Does this fail consistently? what is configured.

> Re-register BlockManager again and again
> 
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> After many times re-register, executor will exit because of timeout 
> exception
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot 
> register with driver: 
> spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Failure.recover(Try.scala:185)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at 
> 

[jira] [Updated] (SPARK-17501) Re-register BlockManager again and again

2016-09-11 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-17501:
--
Description: 
After many times re-register, executor will exit because of timeout 
exception

{code}
16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.



16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot register 
with driver: 
spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
seconds. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at 

[jira] [Updated] (SPARK-17501) Re-register BlockManager again and again

2016-09-11 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-17501:
--
Description: 
{code}
16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.



16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot register 
with driver: 
spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
seconds. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 

[jira] [Created] (SPARK-17501) Re-register BlockManager again and again

2016-09-11 Thread cen yuhai (JIRA)
cen yuhai created SPARK-17501:
-

 Summary: Re-register BlockManager again and again
 Key: SPARK-17501
 URL: https://issues.apache.org/jira/browse/SPARK-17501
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.2
Reporter: cen yuhai


16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with 
master
16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.



16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot register 
with driver: 
spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
seconds. This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Failure.recover(Try.scala:185)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at 

[jira] [Resolved] (SPARK-17486) Remove unused TaskMetricsUIData.updatedBlockStatuses field

2016-09-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17486.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Remove unused TaskMetricsUIData.updatedBlockStatuses field
> --
>
> Key: SPARK-17486
> URL: https://issues.apache.org/jira/browse/SPARK-17486
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.1, 2.1.0
>
>
> The {{TaskMetricsUIData.updatedBlockStatuses}} field is assigned to but never 
> read, increasing the memory consumption of the web UI. We should remove this 
> field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12008) Spark hive security authorization doesn't work as Apache hive's

2016-09-11 Thread pin_zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482828#comment-15482828
 ] 

pin_zhang commented on SPARK-12008:
---

Does Spark SQL have any plan to support authrization in the near futrue?

> Spark hive security authorization doesn't work as Apache hive's
> ---
>
> Key: SPARK-12008
> URL: https://issues.apache.org/jira/browse/SPARK-12008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: pin_zhang
>
> Spark hive security authorization doesn't consistent with apache hive
> The same hive-site.xml
>  
>  hive.security.authorization.enabled
>  true
> 
>
> hive.security.authorization.manager
> org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
> 
> 
> hive.security.authenticator.manager
> org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>   
>
> hive.server2.enable.doAs
> true
> 
> 1. Run spark start-thriftserver.sh, Will meet exception when run sql.
>SQL standards based authorization should not be enabled from hive 
> cliInstead the use of storage based authorization in hive metastore is 
> reccomended. 
>Set hive.security.authorization.enabled=false to disable authz within cli
> 2. Change to start start-thriftserver.sh with hive configurations
> ./start-thriftserver.sh --conf 
> hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory
>  --conf 
> hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator
>  
> 3. Beeline connect with userA and create table tableA.
> 4. Beeline connect with userB to truncate tableA
>   A) In Apache hive, truncate table get exception
>   Error while compiling statement: FAILED: HiveAccessControlException 
> Permission denied: Principal [name=userB, type=USER] does not have following 
> privileges for operation TRUNCATETABLE [[OBJECT OWNERSHIP] on Object 
> [type=TABLE_OR_VIEW, name=default.tablea]] (state=42000,code=4)
>   B) In Spark hive, any user that can connect to the hive, can truncate, as 
> long as the spark user has privileges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2016-09-11 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482812#comment-15482812
 ] 

Dongjoon Hyun commented on SPARK-11374:
---

Which versions of Spark are you using now? For **one** line header removal, 
`spark-csv` package has the workaround for Spark 1.6.x and below. In addition, 
Spark 2.0 also supports that package natively.

https://github.com/databricks/spark-csv

If you want this as a SQL table option, we don't have a workaround.

> skip.header.line.count is ignored in HiveContext
> 
>
> Key: SPARK-11374
> URL: https://issues.apache.org/jira/browse/SPARK-11374
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Daniel Haviv
>
> csv table in Hive which is configured to skip the header row using 
> TBLPROPERTIES("skip.header.line.count"="1").
> When querying from Hive the header row is not included in the data, but when 
> running the same query via HiveContext I get the header row.
> "show create table " via the HiveContext confirms that it is aware of the 
> setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17454) Add option to specify Mesos resource offer constraints

2016-09-11 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482682#comment-15482682
 ] 

Michael Gummelt commented on SPARK-17454:
-

As of Spark 2.0, Mesos mode supports spark.executor.cores

And the scheduler doesn't reserve any disk.  It just writes to the local 
workspace.  Do you have a need for disk reservation?

> Add option to specify Mesos resource offer constraints
> --
>
> Key: SPARK-17454
> URL: https://issues.apache.org/jira/browse/SPARK-17454
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Chris Bannister
>
> Currently the driver will accept offers from Mesos which have enough ram for 
> the executor and until its max cores is reached. There is no way to control 
> the required CPU's or disk for each executor, it would be very useful to be 
> able to apply something similar to spark.mesos.constraints to resource offers 
> instead of attributes on the offer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right

2016-09-11 Thread DjvuLee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-17500:

Description: The DiskBytesSpilled metric in ExternalMerger && 
ExternalGroupBy is increased by file size in each spill, but we only need the 
file size at the last time.  (was: The DiskBytesSpilled metric in 
ExternalMerger && ExternalGroupBy is increment by file size in each spill, but 
we only need the file size at the last time.)

> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right
> -
>
> Key: SPARK-17500
> URL: https://issues.apache.org/jira/browse/SPARK-17500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: DjvuLee
>
> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increased 
> by file size in each spill, but we only need the file size at the last time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right

2016-09-11 Thread DjvuLee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-17500:

Summary: The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy 
is not right  (was: The DiskBytesSpilled metric in ExternalMerger && 
ExternalGroupBy is wrong)

> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right
> -
>
> Key: SPARK-17500
> URL: https://issues.apache.org/jira/browse/SPARK-17500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: DjvuLee
>
> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment 
> by file size in each spill, but we only need the file size at the last time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong

2016-09-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17500:


Assignee: Apache Spark

> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
> -
>
> Key: SPARK-17500
> URL: https://issues.apache.org/jira/browse/SPARK-17500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: DjvuLee
>Assignee: Apache Spark
>
> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment 
> by file size in each spill, but we only need the file size at the last time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong

2016-09-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17500:


Assignee: (was: Apache Spark)

> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
> -
>
> Key: SPARK-17500
> URL: https://issues.apache.org/jira/browse/SPARK-17500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: DjvuLee
>
> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment 
> by file size in each spill, but we only need the file size at the last time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong

2016-09-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482303#comment-15482303
 ] 

Apache Spark commented on SPARK-17500:
--

User 'djvulee' has created a pull request for this issue:
https://github.com/apache/spark/pull/15052

> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
> -
>
> Key: SPARK-17500
> URL: https://issues.apache.org/jira/browse/SPARK-17500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: DjvuLee
>
> The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment 
> by file size in each spill, but we only need the file size at the last time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong

2016-09-11 Thread DjvuLee (JIRA)
DjvuLee created SPARK-17500:
---

 Summary: The DiskBytesSpilled metric in ExternalMerger && 
ExternalGroupBy is wrong
 Key: SPARK-17500
 URL: https://issues.apache.org/jira/browse/SPARK-17500
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.6.0
Reporter: DjvuLee


The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment 
by file size in each spill, but we only need the file size at the last time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier

2016-09-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17499:


Assignee: Apache Spark

> make the default params in sparkR spark.mlp consistent with 
> MultilayerPerceptronClassifier
> --
>
> Key: SPARK-17499
> URL: https://issues.apache.org/jira/browse/SPARK-17499
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> several default params in sparkR spark.mlp is wrong,
> layers should be null
> tol should be 1e-6
> stepSize should be 0.03
> seed should be -763139545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier

2016-09-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481973#comment-15481973
 ] 

Apache Spark commented on SPARK-17499:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15051

> make the default params in sparkR spark.mlp consistent with 
> MultilayerPerceptronClassifier
> --
>
> Key: SPARK-17499
> URL: https://issues.apache.org/jira/browse/SPARK-17499
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> several default params in sparkR spark.mlp is wrong,
> layers should be null
> tol should be 1e-6
> stepSize should be 0.03
> seed should be -763139545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier

2016-09-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17499:


Assignee: (was: Apache Spark)

> make the default params in sparkR spark.mlp consistent with 
> MultilayerPerceptronClassifier
> --
>
> Key: SPARK-17499
> URL: https://issues.apache.org/jira/browse/SPARK-17499
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> several default params in sparkR spark.mlp is wrong,
> layers should be null
> tol should be 1e-6
> stepSize should be 0.03
> seed should be -763139545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier

2016-09-11 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-17499:
--

 Summary: make the default params in sparkR spark.mlp consistent 
with MultilayerPerceptronClassifier
 Key: SPARK-17499
 URL: https://issues.apache.org/jira/browse/SPARK-17499
 Project: Spark
  Issue Type: Improvement
Reporter: Weichen Xu


several default params in sparkR spark.mlp is wrong,

layers should be null
tol should be 1e-6
stepSize should be 0.03
seed should be -763139545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17498) StringIndexer.setHandleInvalid sohuld have another option 'new'

2016-09-11 Thread Miroslav Balaz (JIRA)
Miroslav Balaz created SPARK-17498:
--

 Summary: StringIndexer.setHandleInvalid sohuld have another option 
'new'
 Key: SPARK-17498
 URL: https://issues.apache.org/jira/browse/SPARK-17498
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Miroslav Balaz


That will map unseen label to maximum known label +1, IndexToString would map 
that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17497) Preserve order when scanning ordered buckets over multiple partitions

2016-09-11 Thread Fridtjof Sander (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fridtjof Sander updated SPARK-17497:

Description: 
Non-associative aggregations (like ```collect_list```) require the data to be 
sorted on the grouping key in order to extract aggregation-groups.

Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted 
on `id`. Let `q` be a query, that executes a non-associative aggregation on 
`table.id` over multiple partitions `p`.

Currently, when executing `q`, Spark creates as many RDD-partitions as there 
are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
associated buckets in all requested Hive-partitions. Because the buckets are 
read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
has to be explicitly sorted before performing the aggregation. Therefore an 
execution-pipeline-block is introduced.

In this Jira I propose to offer an alternative bucket-fetching strategy to the 
optimizer, that preserves the internal sorting in a situation described above.

One way to achieve this, is to open all buckets over all partitions 
simultaneously when fetching the data. Since each bucket is internally sorted, 
we can perform basically a merge-sort on the collection of bucket-iterators, 
and directly emit a sorted RDD-partition, that can be piped into the next 
operator.

While there should be no question about the theoretical feasibility of this 
idea, there are some obvious implications i.e. with regards to IO-handling.

I would like to investigate the practical feasibility, limits, gains and 
drawbacks of this optimization in my masters-thesis and, of course, contribute 
the implementation. Before I start, however, I wanted to kindly ask you, the 
community, for any thoughts, opinions, corrections or other kinds of feedback, 
which is much appreciated.

  was:
Non-associative aggregations (like `collect_list`) require the data to be 
sorted on the grouping key in order to extract aggregation-groups.

Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted 
on `id`. Let `q` be a query, that executes a non-associative aggregation on 
`table.id` over multiple partitions `p`.

Currently, when executing `q`, Spark creates as many RDD-partitions as there 
are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
associated buckets in all requested Hive-partitions. Because the buckets are 
read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
has to be explicitly sorted before performing the aggregation. Therefore an 
execution-pipeline-block is introduced.

In this Jira I propose to offer an alternative bucket-fetching strategy to the 
optimizer, that preserves the internal sorting in a situation described above.

One way to achieve this, is to open all buckets over all partitions 
simultaneously when fetching the data. Since each bucket is internally sorted, 
we can perform basically a merge-sort on the collection of bucket-iterators, 
and directly emit a sorted RDD-partition, that can be piped into the next 
operator.

While there should be no question about the theoretical feasibility of this 
idea, there are some obvious implications i.e. with regards to IO-handling.

I would like to investigate the practical feasibility, limits, gains and 
drawbacks of this optimization in my masters-thesis and, of course, contribute 
the implementation. Before I start, however, I wanted to kindly ask you, the 
community, for any thoughts, opinions, corrections or other kinds of feedback, 
which is much appreciated.


> Preserve order when scanning ordered buckets over multiple partitions
> -
>
> Key: SPARK-17497
> URL: https://issues.apache.org/jira/browse/SPARK-17497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Fridtjof Sander
>Priority: Minor
>
> Non-associative aggregations (like ```collect_list```) require the data to be 
> sorted on the grouping key in order to extract aggregation-groups.
> Let `table` be a Hive-table, that is partitioned on `p` and bucketed and 
> sorted on `id`. Let `q` be a query, that executes a non-associative 
> aggregation on `table.id` over multiple partitions `p`.
> Currently, when executing `q`, Spark creates as many RDD-partitions as there 
> are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
> associated buckets in all requested Hive-partitions. Because the buckets are 
> read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
> has to be explicitly sorted before performing the aggregation. Therefore an 
> execution-pipeline-block is introduced.
> In this Jira I propose to offer an alternative bucket-fetching strategy 

[jira] [Updated] (SPARK-17497) Preserve order when scanning ordered buckets over multiple partitions

2016-09-11 Thread Fridtjof Sander (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fridtjof Sander updated SPARK-17497:

Description: 
Non-associative aggregations (like `collect_list`) require the data to be 
sorted on the grouping key in order to extract aggregation-groups.

Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted 
on `id`. Let `q` be a query, that executes a non-associative aggregation on 
`table.id` over multiple partitions `p`.

Currently, when executing `q`, Spark creates as many RDD-partitions as there 
are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
associated buckets in all requested Hive-partitions. Because the buckets are 
read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
has to be explicitly sorted before performing the aggregation. Therefore an 
execution-pipeline-block is introduced.

In this Jira I propose to offer an alternative bucket-fetching strategy to the 
optimizer, that preserves the internal sorting in a situation described above.

One way to achieve this, is to open all buckets over all partitions 
simultaneously when fetching the data. Since each bucket is internally sorted, 
we can perform basically a merge-sort on the collection of bucket-iterators, 
and directly emit a sorted RDD-partition, that can be piped into the next 
operator.

While there should be no question about the theoretical feasibility of this 
idea, there are some obvious implications i.e. with regards to IO-handling.

I would like to investigate the practical feasibility, limits, gains and 
drawbacks of this optimization in my masters-thesis and, of course, contribute 
the implementation. Before I start, however, I wanted to kindly ask you, the 
community, for any thoughts, opinions, corrections or other kinds of feedback, 
which is much appreciated.

  was:
Non-associative aggregations (like ```collect_list```) require the data to be 
sorted on the grouping key in order to extract aggregation-groups.

Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted 
on `id`. Let `q` be a query, that executes a non-associative aggregation on 
`table.id` over multiple partitions `p`.

Currently, when executing `q`, Spark creates as many RDD-partitions as there 
are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
associated buckets in all requested Hive-partitions. Because the buckets are 
read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
has to be explicitly sorted before performing the aggregation. Therefore an 
execution-pipeline-block is introduced.

In this Jira I propose to offer an alternative bucket-fetching strategy to the 
optimizer, that preserves the internal sorting in a situation described above.

One way to achieve this, is to open all buckets over all partitions 
simultaneously when fetching the data. Since each bucket is internally sorted, 
we can perform basically a merge-sort on the collection of bucket-iterators, 
and directly emit a sorted RDD-partition, that can be piped into the next 
operator.

While there should be no question about the theoretical feasibility of this 
idea, there are some obvious implications i.e. with regards to IO-handling.

I would like to investigate the practical feasibility, limits, gains and 
drawbacks of this optimization in my masters-thesis and, of course, contribute 
the implementation. Before I start, however, I wanted to kindly ask you, the 
community, for any thoughts, opinions, corrections or other kinds of feedback, 
which is much appreciated.


> Preserve order when scanning ordered buckets over multiple partitions
> -
>
> Key: SPARK-17497
> URL: https://issues.apache.org/jira/browse/SPARK-17497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Fridtjof Sander
>Priority: Minor
>
> Non-associative aggregations (like `collect_list`) require the data to be 
> sorted on the grouping key in order to extract aggregation-groups.
> Let `table` be a Hive-table, that is partitioned on `p` and bucketed and 
> sorted on `id`. Let `q` be a query, that executes a non-associative 
> aggregation on `table.id` over multiple partitions `p`.
> Currently, when executing `q`, Spark creates as many RDD-partitions as there 
> are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
> associated buckets in all requested Hive-partitions. Because the buckets are 
> read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
> has to be explicitly sorted before performing the aggregation. Therefore an 
> execution-pipeline-block is introduced.
> In this Jira I propose to offer an alternative bucket-fetching strategy to 

[jira] [Created] (SPARK-17497) Preserve order when scanning ordered buckets over multiple partitions

2016-09-11 Thread Fridtjof Sander (JIRA)
Fridtjof Sander created SPARK-17497:
---

 Summary: Preserve order when scanning ordered buckets over 
multiple partitions
 Key: SPARK-17497
 URL: https://issues.apache.org/jira/browse/SPARK-17497
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Fridtjof Sander
Priority: Minor


Non-associative aggregations (like `collect_list`) require the data to be 
sorted on the grouping key in order to extract aggregation-groups.

Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted 
on `id`. Let `q` be a query, that executes a non-associative aggregation on 
`table.id` over multiple partitions `p`.

Currently, when executing `q`, Spark creates as many RDD-partitions as there 
are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the 
associated buckets in all requested Hive-partitions. Because the buckets are 
read one-by-one, the resulting RDD-partition is no longer sorted on `id` and 
has to be explicitly sorted before performing the aggregation. Therefore an 
execution-pipeline-block is introduced.

In this Jira I propose to offer an alternative bucket-fetching strategy to the 
optimizer, that preserves the internal sorting in a situation described above.

One way to achieve this, is to open all buckets over all partitions 
simultaneously when fetching the data. Since each bucket is internally sorted, 
we can perform basically a merge-sort on the collection of bucket-iterators, 
and directly emit a sorted RDD-partition, that can be piped into the next 
operator.

While there should be no question about the theoretical feasibility of this 
idea, there are some obvious implications i.e. with regards to IO-handling.

I would like to investigate the practical feasibility, limits, gains and 
drawbacks of this optimization in my masters-thesis and, of course, contribute 
the implementation. Before I start, however, I wanted to kindly ask you, the 
community, for any thoughts, opinions, corrections or other kinds of feedback, 
which is much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17389) KMeans speedup with better choice of k-means|| init steps = 2

2016-09-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481490#comment-15481490
 ] 

Apache Spark commented on SPARK-17389:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15050

> KMeans speedup with better choice of k-means|| init steps = 2
> -
>
> Key: SPARK-17389
> URL: https://issues.apache.org/jira/browse/SPARK-17389
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> As reported in 
> http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820
>  KMeans can be surprisingly slow, and it's easy to see that most of the time 
> spent is in kmeans|| initialization. For example, in this simple example...
> {code}
> import org.apache.spark.mllib.random.RandomRDDs
> import org.apache.spark.mllib.clustering.KMeans
> val data = RandomRDDs.uniformVectorRDD(sc, 100, 64, 
> sc.defaultParallelism).cache()
> data.count()
> new KMeans().setK(1000).setMaxIterations(5).run(data)
> {code}
> Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init 
> takes about as long as 24 iterations, which is a typical run, meaning half 
> the time is just in picking cluster centers. This seems excessive.
> There are two ways to speed this up significantly. First, the implementation 
> has an old "runs" parameter that is always 1 now. It used to allow multiple 
> clusterings to be computed at once. The code can be simplified significantly 
> now that runs=1 always. This is already covered by SPARK-11560, but just a 
> simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in 
> this example. That's not what this change is about though.
> By default, k-means|| makes 5 passes over the data. The original paper at 
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows 
> that 2 is plenty, certainly when l=2k as is the case in our implementation. 
> (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but 
> it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 
> 1:41 -- much improved over 5:54.
> Lastly, small thing, but the code will perform a local k-means++ step to 
> reduce the number of centers to k even if there are already only <= k 
> centers. This can be short-circuited. However this is really the topic of 
> SPARK-3261 because this can cause fewer than k clusters to be returned where 
> that would actually be correct, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17336) Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17336.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15028
[https://github.com/apache/spark/pull/15028]

> Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate
> -
>
> Key: SPARK-17336
> URL: https://issues.apache.org/jira/browse/SPARK-17336
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: anxu
> Fix For: 2.0.1, 2.1.0
>
>
> On Spark start up by command: sbin/start-all.sh, the sbin/spark-config.sh 
> Repeated calls. In sbin/spark-config.sh code.
> {code:title=sbin/spark-config.sh|borderStyle=solid}
> # Add the PySpark classes to the PYTHONPATH:
> export PYTHONPATH="${SPARK_HOME}/python:${PYTHONPATH}"
> export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:${PYTHONPATH}"
> {code}
> {color:red}PYTHONPATH{color} has duplicate Value.
> example:
> {code:borderStyle=solid}
> axu4iMac:spark-2.0.0-hadoop2.4 axu$ sbin/start-all.sh  | grep PYTHONPATH
> axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17336) Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17336:
--
Assignee: Bryan Cutler
Priority: Minor  (was: Major)

> Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate
> -
>
> Key: SPARK-17336
> URL: https://issues.apache.org/jira/browse/SPARK-17336
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: anxu
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> On Spark start up by command: sbin/start-all.sh, the sbin/spark-config.sh 
> Repeated calls. In sbin/spark-config.sh code.
> {code:title=sbin/spark-config.sh|borderStyle=solid}
> # Add the PySpark classes to the PYTHONPATH:
> export PYTHONPATH="${SPARK_HOME}/python:${PYTHONPATH}"
> export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:${PYTHONPATH}"
> {code}
> {color:red}PYTHONPATH{color} has duplicate Value.
> example:
> {code:borderStyle=solid}
> axu4iMac:spark-2.0.0-hadoop2.4 axu$ sbin/start-all.sh  | grep PYTHONPATH
> axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH
> axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): 
> [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17330) Clean up spark-warehouse in UT

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17330.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14894
[https://github.com/apache/spark/pull/14894]

> Clean up spark-warehouse in UT
> --
>
> Key: SPARK-17330
> URL: https://issues.apache.org/jira/browse/SPARK-17330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: tone
>Assignee: tone
>Priority: Minor
> Fix For: 2.1.0
>
>
> When run Spark UT based on the latest version of master branch, the UT case 
> (SPARK-8368) can be passed at the first time, but always fail if run it 
> again. The error log is as below:
> [info]   2016-08-31 09:35:51.967 - stderr> 16/08/31 09:35:51 ERROR 
> RetryingHMSHandler: AlreadyExistsException(message:Database default already 
> exists)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> com.sun.proxy.$Proxy18.create_database(Unknown Source)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> com.sun.proxy.$Proxy19.createDatabase(Unknown Source)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:310)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:309)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:120)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:87)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> 

[jira] [Updated] (SPARK-17330) Clean up spark-warehouse in UT

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17330:
--
  Assignee: tone
Issue Type: Improvement  (was: Bug)

> Clean up spark-warehouse in UT
> --
>
> Key: SPARK-17330
> URL: https://issues.apache.org/jira/browse/SPARK-17330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: tone
>Assignee: tone
>Priority: Minor
> Fix For: 2.1.0
>
>
> When run Spark UT based on the latest version of master branch, the UT case 
> (SPARK-8368) can be passed at the first time, but always fail if run it 
> again. The error log is as below:
> [info]   2016-08-31 09:35:51.967 - stderr> 16/08/31 09:35:51 ERROR 
> RetryingHMSHandler: AlreadyExistsException(message:Database default already 
> exists)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> com.sun.proxy.$Proxy18.create_database(Unknown Source)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> [info]   2016-08-31 09:35:51.967 - stderr>  at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> com.sun.proxy.$Proxy19.createDatabase(Unknown Source)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:310)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:309)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:120)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:87)
> [info]   2016-08-31 09:35:51.968 - stderr>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:119)
> 

[jira] [Resolved] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16834.
---
Resolution: Won't Fix

I think the answer is the same as in 
https://issues.apache.org/jira/browse/SPARK-16832 because it's apparently by 
design that these classes are deterministic by default. If you seed the split 
randomly it'll work as you expect.

> TrainValildationSplit and direct evaluation produce different scores
> 
>
> Key: SPARK-16834
> URL: https://issues.apache.org/jira/browse/SPARK-16834
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>
> The two segments of code below are supposed to do the same thing: one is 
> using TrainValidationSplit, the other performs the same evaluation manually. 
> However, their results are statistically different (in my case, in a loop of 
> 20, I regularly get ~19 True values). 
> Unfortunately, I didn't find the bug in the source code.
> {code}
> dataset = spark.createDataFrame(
>   [(Vectors.dense([0.0]), 0.0),
>(Vectors.dense([0.4]), 1.0),
>(Vectors.dense([0.5]), 0.0),
>(Vectors.dense([0.6]), 1.0),
>(Vectors.dense([1.0]), 1.0)] * 1000,
>   ["features", "label"]).cache()
> paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
> # note that test is NEVER used in this code
> # I create it only to utilize randomSplit
> for i in range(20):
>   train, test = dataset.randomSplit([0.8, 0.2])
>   tvs = 
> pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(),
>  
>  estimatorParamMaps=paramGrid,
>  
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>  trainRatio=0.5)
>   model = tvs.fit(train)
>   train, val, test = dataset.randomSplit([0.4, 0.4, 0.2])
>   lr=pyspark.ml.regression.LinearRegression()
>   evaluator=pyspark.ml.evaluation.RegressionEvaluator()
>   lrModel = lr.fit(train)
>   predicted = lrModel.transform(val)
>   print(model.validationMetrics[0] < evaluator.evaluate(predicted))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17439) QuantilesSummaries returns the wrong result after compression

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17439.
---
   Resolution: Fixed
 Assignee: Tim Hunter
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/15002

> QuantilesSummaries returns the wrong result after compression
> -
>
> Key: SPARK-17439
> URL: https://issues.apache.org/jira/browse/SPARK-17439
> Project: Spark
>  Issue Type: Bug
>Reporter: Tim Hunter
>Assignee: Tim Hunter
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> [~clockfly] found the following corner case that returns the wrong quantile 
> (off by 1):
> {code}
> test("test QuantileSummaries compression") {
> var left = new QuantileSummaries(1, 0.0001)
> System.out.println("LEFT  RIGHT")
> System.out.println("")
> (0 to 10).foreach { index =>
>   left = left.insert(index)
>   left = left.compress()
>   var right = new QuantileSummaries(1, 0.0001)
>   (0 to index).foreach(right.insert(_))
>   right = right.compress()
>   System.out.println(s"${left.query(0.5)}   ${right.query(0.5)}")
> }
>   }
> {code}
> The result is:
> {code}
> LEFT  RIGHT
> 
> 0.0   0.0
> 0.0   1.0
> 0.0   1.0
> 0.0   1.0
> 1.0   2.0
> 1.0   2.0
> 2.0   3.0
> 2.0   3.0
> 3.0   4.0
> 3.0   4.0
> 4.0   5.0
> {code}
> The value of the "LEFT" column represents the output when using 
> QuantileSummaries in Window function, the value on the "RIGHT" column 
> represents the expected result. The different between "LEFT" and "RIGHT" 
> column is that the "LEFT" column does intermediate compression on the storage 
> of QuantileSummaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17306) QuantileSummaries doesn't compress

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17306.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0
   2.0.1

Resolved by https://github.com/apache/spark/pull/15002 (based on 
https://github.com/apache/spark/pull/14976)

> QuantileSummaries doesn't compress
> --
>
> Key: SPARK-17306
> URL: https://issues.apache.org/jira/browse/SPARK-17306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Owen
> Fix For: 2.0.1, 2.1.0
>
>
> compressThreshold was not referenced anywhere
> {code}
> class QuantileSummaries(
> val compressThreshold: Int,
> val relativeError: Double,
> val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty,
> private[stat] var count: Long = 0L,
> val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends 
> Serializable
> {code}
> And, it causes memory leak, QuantileSummaries takes unbounded memory
> {code}
> val summary = new QuantileSummaries(1, relativeError = 0.001)
> // Results in creating an array of size 1 !!! 
> (1 to 1).foreach(summary.insert(_))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores

2016-09-11 Thread Max Moroz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481341#comment-15481341
 ] 

Max Moroz commented on SPARK-16834:
---

[~bryanc] thanks for looking into this. I have no doubt that my code can be 
modified to yield the same result as TrainValidationSplit by ensuring that it 
is (algorithmically) identical. But I thought it should behave nearly 
identically as it stands, without modification.

In each iteration, the two metrics should of course differ. But the metrics 
should be random numbers drawn from two identical distributions. Unfortunately, 
it's not the case.

I was thinking maybe there's a problem with permutation of the data (but it 
doesn't seem to be); or perhaps TVS, unlike randomSplit, results in 
precisely-sized individual slices (I doubt).

> TrainValildationSplit and direct evaluation produce different scores
> 
>
> Key: SPARK-16834
> URL: https://issues.apache.org/jira/browse/SPARK-16834
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>
> The two segments of code below are supposed to do the same thing: one is 
> using TrainValidationSplit, the other performs the same evaluation manually. 
> However, their results are statistically different (in my case, in a loop of 
> 20, I regularly get ~19 True values). 
> Unfortunately, I didn't find the bug in the source code.
> {code}
> dataset = spark.createDataFrame(
>   [(Vectors.dense([0.0]), 0.0),
>(Vectors.dense([0.4]), 1.0),
>(Vectors.dense([0.5]), 0.0),
>(Vectors.dense([0.6]), 1.0),
>(Vectors.dense([1.0]), 1.0)] * 1000,
>   ["features", "label"]).cache()
> paramGrid = pyspark.ml.tuning.ParamGridBuilder().build()
> # note that test is NEVER used in this code
> # I create it only to utilize randomSplit
> for i in range(20):
>   train, test = dataset.randomSplit([0.8, 0.2])
>   tvs = 
> pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(),
>  
>  estimatorParamMaps=paramGrid,
>  
> evaluator=pyspark.ml.evaluation.RegressionEvaluator(),
>  trainRatio=0.5)
>   model = tvs.fit(train)
>   train, val, test = dataset.randomSplit([0.4, 0.4, 0.2])
>   lr=pyspark.ml.regression.LinearRegression()
>   evaluator=pyspark.ml.evaluation.RegressionEvaluator()
>   lrModel = lr.fit(train)
>   predicted = lrModel.transform(val)
>   print(model.validationMetrics[0] < evaluator.evaluate(predicted))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode

2016-09-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481314#comment-15481314
 ] 

Sean Owen commented on SPARK-17493:
---

I don't think this is enough info. What does 'hang' mean here, is there a 
reproduction? what do thread dumps show is going on? 

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> --
>
> Key: SPARK-17493
> URL: https://issues.apache.org/jira/browse/SPARK-17493
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: AWS Cluster
>Reporter: Gautam Solanki
>
> While saving a RDD to HDFS path in parquet format with the following 
> rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//")
>  , the spark job was hanging as the two write tasks with Shuffle Read of size 
> 0 could not complete. But, the executors notified the driver about the 
> completion of these two tasks. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15687) Columnar execution engine

2016-09-11 Thread Kiran Lonikar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kiran Lonikar updated SPARK-15687:
--
Comment: was deleted

(was: I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), 
ByteBuffer or even memory mapped files (mmap) based implementation for taking 
advantage of NVRAM memory based systems.

Finally, it will be possible to use directly GPU RAM based implementation.)

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15687) Columnar execution engine

2016-09-11 Thread Kiran Lonikar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481272#comment-15481272
 ] 

Kiran Lonikar commented on SPARK-15687:
---

I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), 
ByteBuffer or even memory mapped files (mmap) based implementation for taking 
advantage of NVRAM memory based systems.

Finally, it will be possible to use directly GPU RAM based implementation.

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15687) Columnar execution engine

2016-09-11 Thread Kiran Lonikar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481268#comment-15481268
 ] 

Kiran Lonikar edited comment on SPARK-15687 at 9/11/16 7:29 AM:


Evan Chan, I agree. It will then be possible to offer off-heap 
(sun.misc.Unsafe), ByteBuffer or even memory mapped files (mmap) based 
implementation for taking advantage of NVRAM memory based systems.

Finally, it will be possible to use directly GPU RAM based implementation.


was (Author: klonikar):
I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), 
ByteBuffer or even memory mapped files (mmap) based implementation for taking 
advantage of NVRAM memory based systems.

Finally, it will be possible to use directly GPU RAM based implementation.

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17496) missing int to float coercion in df.sample() signature

2016-09-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481270#comment-15481270
 ] 

Sean Owen commented on SPARK-17496:
---

I don't think that's a bug. The value is virtually always in (0,1), so being 
able to specify 0 or 1 even is not typically useful.

> missing int to float coercion in df.sample() signature
> --
>
> Key: SPARK-17496
> URL: https://issues.apache.org/jira/browse/SPARK-17496
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Trivial
>
> {code}
> # works
> spark.createDataFrame([[1], [2], [3]]).sample(True, 1.0)
> # doesn't work
> spark.createDataFrame([[1], [2], [3]]).sample(True, 1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15687) Columnar execution engine

2016-09-11 Thread Kiran Lonikar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481268#comment-15481268
 ] 

Kiran Lonikar commented on SPARK-15687:
---

I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), 
ByteBuffer or even memory mapped files (mmap) based implementation for taking 
advantage of NVRAM memory based systems.

Finally, it will be possible to use directly GPU RAM based implementation.

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17389) KMeans speedup with better choice of k-means|| init steps = 2

2016-09-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17389.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14956
[https://github.com/apache/spark/pull/14956]

> KMeans speedup with better choice of k-means|| init steps = 2
> -
>
> Key: SPARK-17389
> URL: https://issues.apache.org/jira/browse/SPARK-17389
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> As reported in 
> http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820
>  KMeans can be surprisingly slow, and it's easy to see that most of the time 
> spent is in kmeans|| initialization. For example, in this simple example...
> {code}
> import org.apache.spark.mllib.random.RandomRDDs
> import org.apache.spark.mllib.clustering.KMeans
> val data = RandomRDDs.uniformVectorRDD(sc, 100, 64, 
> sc.defaultParallelism).cache()
> data.count()
> new KMeans().setK(1000).setMaxIterations(5).run(data)
> {code}
> Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init 
> takes about as long as 24 iterations, which is a typical run, meaning half 
> the time is just in picking cluster centers. This seems excessive.
> There are two ways to speed this up significantly. First, the implementation 
> has an old "runs" parameter that is always 1 now. It used to allow multiple 
> clusterings to be computed at once. The code can be simplified significantly 
> now that runs=1 always. This is already covered by SPARK-11560, but just a 
> simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in 
> this example. That's not what this change is about though.
> By default, k-means|| makes 5 passes over the data. The original paper at 
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows 
> that 2 is plenty, certainly when l=2k as is the case in our implementation. 
> (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but 
> it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 
> 1:41 -- much improved over 5:54.
> Lastly, small thing, but the code will perform a local k-means++ step to 
> reduce the number of centers to k even if there are already only <= k 
> centers. This can be short-circuited. However this is really the topic of 
> SPARK-3261 because this can cause fewer than k clusters to be returned where 
> that would actually be correct, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org