[jira] [Commented] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15483144#comment-15483144 ] Jagadeesan A S commented on SPARK-17501: Does this fail consistently? what is configured. > Re-register BlockManager again and again > > > Key: SPARK-17501 > URL: https://issues.apache.org/jira/browse/SPARK-17501 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: cen yuhai > > After many times re-register, executor will exit because of timeout > exception > {code} > 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat > 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with > master > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager > 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. > 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot > register with driver: > spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 > org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 > seconds. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) > at scala.util.Try$.apply(Try.scala:161) > at scala.util.Failure.recover(Try.scala:185) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at >
[jira] [Updated] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-17501: -- Description: After many times re-register, executor will exit because of timeout exception {code} 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) at scala.util.Try$.apply(Try.scala:161) at scala.util.Failure.recover(Try.scala:185) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at scala.concurrent.Promise$class.complete(Promise.scala:55) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at
[jira] [Updated] (SPARK-17501) Re-register BlockManager again and again
[ https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-17501: -- Description: {code} 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) at scala.util.Try$.apply(Try.scala:161) at scala.util.Failure.recover(Try.scala:185) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at scala.concurrent.Promise$class.complete(Promise.scala:55) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at
[jira] [Created] (SPARK-17501) Re-register BlockManager again and again
cen yuhai created SPARK-17501: - Summary: Re-register BlockManager again and again Key: SPARK-17501 URL: https://issues.apache.org/jira/browse/SPARK-17501 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.2 Reporter: cen yuhai 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with master 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master. 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot register with driver: spark://coarsegrainedschedu...@bigdata-arch-jms05.xg01.diditaxi.com:22168 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) at scala.util.Try$.apply(Try.scala:161) at scala.util.Failure.recover(Try.scala:185) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at scala.concurrent.Promise$class.complete(Promise.scala:55) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at
[jira] [Resolved] (SPARK-17486) Remove unused TaskMetricsUIData.updatedBlockStatuses field
[ https://issues.apache.org/jira/browse/SPARK-17486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17486. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Remove unused TaskMetricsUIData.updatedBlockStatuses field > -- > > Key: SPARK-17486 > URL: https://issues.apache.org/jira/browse/SPARK-17486 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.1, 2.1.0 > > > The {{TaskMetricsUIData.updatedBlockStatuses}} field is assigned to but never > read, increasing the memory consumption of the web UI. We should remove this > field. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12008) Spark hive security authorization doesn't work as Apache hive's
[ https://issues.apache.org/jira/browse/SPARK-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482828#comment-15482828 ] pin_zhang commented on SPARK-12008: --- Does Spark SQL have any plan to support authrization in the near futrue? > Spark hive security authorization doesn't work as Apache hive's > --- > > Key: SPARK-12008 > URL: https://issues.apache.org/jira/browse/SPARK-12008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: pin_zhang > > Spark hive security authorization doesn't consistent with apache hive > The same hive-site.xml > > hive.security.authorization.enabled > true > > > hive.security.authorization.manager > org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > > > hive.security.authenticator.manager > org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > > > hive.server2.enable.doAs > true > > 1. Run spark start-thriftserver.sh, Will meet exception when run sql. >SQL standards based authorization should not be enabled from hive > cliInstead the use of storage based authorization in hive metastore is > reccomended. >Set hive.security.authorization.enabled=false to disable authz within cli > 2. Change to start start-thriftserver.sh with hive configurations > ./start-thriftserver.sh --conf > hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory > --conf > hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator > > 3. Beeline connect with userA and create table tableA. > 4. Beeline connect with userB to truncate tableA > A) In Apache hive, truncate table get exception > Error while compiling statement: FAILED: HiveAccessControlException > Permission denied: Principal [name=userB, type=USER] does not have following > privileges for operation TRUNCATETABLE [[OBJECT OWNERSHIP] on Object > [type=TABLE_OR_VIEW, name=default.tablea]] (state=42000,code=4) > B) In Spark hive, any user that can connect to the hive, can truncate, as > long as the spark user has privileges. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11374) skip.header.line.count is ignored in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482812#comment-15482812 ] Dongjoon Hyun commented on SPARK-11374: --- Which versions of Spark are you using now? For **one** line header removal, `spark-csv` package has the workaround for Spark 1.6.x and below. In addition, Spark 2.0 also supports that package natively. https://github.com/databricks/spark-csv If you want this as a SQL table option, we don't have a workaround. > skip.header.line.count is ignored in HiveContext > > > Key: SPARK-11374 > URL: https://issues.apache.org/jira/browse/SPARK-11374 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Daniel Haviv > > csv table in Hive which is configured to skip the header row using > TBLPROPERTIES("skip.header.line.count"="1"). > When querying from Hive the header row is not included in the data, but when > running the same query via HiveContext I get the header row. > "show create table " via the HiveContext confirms that it is aware of the > setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17454) Add option to specify Mesos resource offer constraints
[ https://issues.apache.org/jira/browse/SPARK-17454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482682#comment-15482682 ] Michael Gummelt commented on SPARK-17454: - As of Spark 2.0, Mesos mode supports spark.executor.cores And the scheduler doesn't reserve any disk. It just writes to the local workspace. Do you have a need for disk reservation? > Add option to specify Mesos resource offer constraints > -- > > Key: SPARK-17454 > URL: https://issues.apache.org/jira/browse/SPARK-17454 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.0.0 >Reporter: Chris Bannister > > Currently the driver will accept offers from Mesos which have enough ram for > the executor and until its max cores is reached. There is no way to control > the required CPU's or disk for each executor, it would be very useful to be > able to apply something similar to spark.mesos.constraints to resource offers > instead of attributes on the offer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right
[ https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-17500: Description: The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increased by file size in each spill, but we only need the file size at the last time. (was: The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment by file size in each spill, but we only need the file size at the last time.) > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right > - > > Key: SPARK-17500 > URL: https://issues.apache.org/jira/browse/SPARK-17500 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: DjvuLee > > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increased > by file size in each spill, but we only need the file size at the last time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right
[ https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DjvuLee updated SPARK-17500: Summary: The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right (was: The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong) > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is not right > - > > Key: SPARK-17500 > URL: https://issues.apache.org/jira/browse/SPARK-17500 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: DjvuLee > > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment > by file size in each spill, but we only need the file size at the last time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
[ https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17500: Assignee: Apache Spark > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong > - > > Key: SPARK-17500 > URL: https://issues.apache.org/jira/browse/SPARK-17500 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: DjvuLee >Assignee: Apache Spark > > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment > by file size in each spill, but we only need the file size at the last time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
[ https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17500: Assignee: (was: Apache Spark) > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong > - > > Key: SPARK-17500 > URL: https://issues.apache.org/jira/browse/SPARK-17500 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: DjvuLee > > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment > by file size in each spill, but we only need the file size at the last time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
[ https://issues.apache.org/jira/browse/SPARK-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15482303#comment-15482303 ] Apache Spark commented on SPARK-17500: -- User 'djvulee' has created a pull request for this issue: https://github.com/apache/spark/pull/15052 > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong > - > > Key: SPARK-17500 > URL: https://issues.apache.org/jira/browse/SPARK-17500 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: DjvuLee > > The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment > by file size in each spill, but we only need the file size at the last time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17500) The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong
DjvuLee created SPARK-17500: --- Summary: The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is wrong Key: SPARK-17500 URL: https://issues.apache.org/jira/browse/SPARK-17500 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.6.0 Reporter: DjvuLee The DiskBytesSpilled metric in ExternalMerger && ExternalGroupBy is increment by file size in each spill, but we only need the file size at the last time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17499: Assignee: Apache Spark > make the default params in sparkR spark.mlp consistent with > MultilayerPerceptronClassifier > -- > > Key: SPARK-17499 > URL: https://issues.apache.org/jira/browse/SPARK-17499 > Project: Spark > Issue Type: Improvement >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > several default params in sparkR spark.mlp is wrong, > layers should be null > tol should be 1e-6 > stepSize should be 0.03 > seed should be -763139545 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481973#comment-15481973 ] Apache Spark commented on SPARK-17499: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/15051 > make the default params in sparkR spark.mlp consistent with > MultilayerPerceptronClassifier > -- > > Key: SPARK-17499 > URL: https://issues.apache.org/jira/browse/SPARK-17499 > Project: Spark > Issue Type: Improvement >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > several default params in sparkR spark.mlp is wrong, > layers should be null > tol should be 1e-6 > stepSize should be 0.03 > seed should be -763139545 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
[ https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17499: Assignee: (was: Apache Spark) > make the default params in sparkR spark.mlp consistent with > MultilayerPerceptronClassifier > -- > > Key: SPARK-17499 > URL: https://issues.apache.org/jira/browse/SPARK-17499 > Project: Spark > Issue Type: Improvement >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > several default params in sparkR spark.mlp is wrong, > layers should be null > tol should be 1e-6 > stepSize should be 0.03 > seed should be -763139545 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier
Weichen Xu created SPARK-17499: -- Summary: make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier Key: SPARK-17499 URL: https://issues.apache.org/jira/browse/SPARK-17499 Project: Spark Issue Type: Improvement Reporter: Weichen Xu several default params in sparkR spark.mlp is wrong, layers should be null tol should be 1e-6 stepSize should be 0.03 seed should be -763139545 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17498) StringIndexer.setHandleInvalid sohuld have another option 'new'
Miroslav Balaz created SPARK-17498: -- Summary: StringIndexer.setHandleInvalid sohuld have another option 'new' Key: SPARK-17498 URL: https://issues.apache.org/jira/browse/SPARK-17498 Project: Spark Issue Type: Improvement Components: ML Reporter: Miroslav Balaz That will map unseen label to maximum known label +1, IndexToString would map that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17497) Preserve order when scanning ordered buckets over multiple partitions
[ https://issues.apache.org/jira/browse/SPARK-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fridtjof Sander updated SPARK-17497: Description: Non-associative aggregations (like ```collect_list```) require the data to be sorted on the grouping key in order to extract aggregation-groups. Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted on `id`. Let `q` be a query, that executes a non-associative aggregation on `table.id` over multiple partitions `p`. Currently, when executing `q`, Spark creates as many RDD-partitions as there are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the associated buckets in all requested Hive-partitions. Because the buckets are read one-by-one, the resulting RDD-partition is no longer sorted on `id` and has to be explicitly sorted before performing the aggregation. Therefore an execution-pipeline-block is introduced. In this Jira I propose to offer an alternative bucket-fetching strategy to the optimizer, that preserves the internal sorting in a situation described above. One way to achieve this, is to open all buckets over all partitions simultaneously when fetching the data. Since each bucket is internally sorted, we can perform basically a merge-sort on the collection of bucket-iterators, and directly emit a sorted RDD-partition, that can be piped into the next operator. While there should be no question about the theoretical feasibility of this idea, there are some obvious implications i.e. with regards to IO-handling. I would like to investigate the practical feasibility, limits, gains and drawbacks of this optimization in my masters-thesis and, of course, contribute the implementation. Before I start, however, I wanted to kindly ask you, the community, for any thoughts, opinions, corrections or other kinds of feedback, which is much appreciated. was: Non-associative aggregations (like `collect_list`) require the data to be sorted on the grouping key in order to extract aggregation-groups. Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted on `id`. Let `q` be a query, that executes a non-associative aggregation on `table.id` over multiple partitions `p`. Currently, when executing `q`, Spark creates as many RDD-partitions as there are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the associated buckets in all requested Hive-partitions. Because the buckets are read one-by-one, the resulting RDD-partition is no longer sorted on `id` and has to be explicitly sorted before performing the aggregation. Therefore an execution-pipeline-block is introduced. In this Jira I propose to offer an alternative bucket-fetching strategy to the optimizer, that preserves the internal sorting in a situation described above. One way to achieve this, is to open all buckets over all partitions simultaneously when fetching the data. Since each bucket is internally sorted, we can perform basically a merge-sort on the collection of bucket-iterators, and directly emit a sorted RDD-partition, that can be piped into the next operator. While there should be no question about the theoretical feasibility of this idea, there are some obvious implications i.e. with regards to IO-handling. I would like to investigate the practical feasibility, limits, gains and drawbacks of this optimization in my masters-thesis and, of course, contribute the implementation. Before I start, however, I wanted to kindly ask you, the community, for any thoughts, opinions, corrections or other kinds of feedback, which is much appreciated. > Preserve order when scanning ordered buckets over multiple partitions > - > > Key: SPARK-17497 > URL: https://issues.apache.org/jira/browse/SPARK-17497 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Fridtjof Sander >Priority: Minor > > Non-associative aggregations (like ```collect_list```) require the data to be > sorted on the grouping key in order to extract aggregation-groups. > Let `table` be a Hive-table, that is partitioned on `p` and bucketed and > sorted on `id`. Let `q` be a query, that executes a non-associative > aggregation on `table.id` over multiple partitions `p`. > Currently, when executing `q`, Spark creates as many RDD-partitions as there > are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the > associated buckets in all requested Hive-partitions. Because the buckets are > read one-by-one, the resulting RDD-partition is no longer sorted on `id` and > has to be explicitly sorted before performing the aggregation. Therefore an > execution-pipeline-block is introduced. > In this Jira I propose to offer an alternative bucket-fetching strategy
[jira] [Updated] (SPARK-17497) Preserve order when scanning ordered buckets over multiple partitions
[ https://issues.apache.org/jira/browse/SPARK-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fridtjof Sander updated SPARK-17497: Description: Non-associative aggregations (like `collect_list`) require the data to be sorted on the grouping key in order to extract aggregation-groups. Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted on `id`. Let `q` be a query, that executes a non-associative aggregation on `table.id` over multiple partitions `p`. Currently, when executing `q`, Spark creates as many RDD-partitions as there are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the associated buckets in all requested Hive-partitions. Because the buckets are read one-by-one, the resulting RDD-partition is no longer sorted on `id` and has to be explicitly sorted before performing the aggregation. Therefore an execution-pipeline-block is introduced. In this Jira I propose to offer an alternative bucket-fetching strategy to the optimizer, that preserves the internal sorting in a situation described above. One way to achieve this, is to open all buckets over all partitions simultaneously when fetching the data. Since each bucket is internally sorted, we can perform basically a merge-sort on the collection of bucket-iterators, and directly emit a sorted RDD-partition, that can be piped into the next operator. While there should be no question about the theoretical feasibility of this idea, there are some obvious implications i.e. with regards to IO-handling. I would like to investigate the practical feasibility, limits, gains and drawbacks of this optimization in my masters-thesis and, of course, contribute the implementation. Before I start, however, I wanted to kindly ask you, the community, for any thoughts, opinions, corrections or other kinds of feedback, which is much appreciated. was: Non-associative aggregations (like ```collect_list```) require the data to be sorted on the grouping key in order to extract aggregation-groups. Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted on `id`. Let `q` be a query, that executes a non-associative aggregation on `table.id` over multiple partitions `p`. Currently, when executing `q`, Spark creates as many RDD-partitions as there are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the associated buckets in all requested Hive-partitions. Because the buckets are read one-by-one, the resulting RDD-partition is no longer sorted on `id` and has to be explicitly sorted before performing the aggregation. Therefore an execution-pipeline-block is introduced. In this Jira I propose to offer an alternative bucket-fetching strategy to the optimizer, that preserves the internal sorting in a situation described above. One way to achieve this, is to open all buckets over all partitions simultaneously when fetching the data. Since each bucket is internally sorted, we can perform basically a merge-sort on the collection of bucket-iterators, and directly emit a sorted RDD-partition, that can be piped into the next operator. While there should be no question about the theoretical feasibility of this idea, there are some obvious implications i.e. with regards to IO-handling. I would like to investigate the practical feasibility, limits, gains and drawbacks of this optimization in my masters-thesis and, of course, contribute the implementation. Before I start, however, I wanted to kindly ask you, the community, for any thoughts, opinions, corrections or other kinds of feedback, which is much appreciated. > Preserve order when scanning ordered buckets over multiple partitions > - > > Key: SPARK-17497 > URL: https://issues.apache.org/jira/browse/SPARK-17497 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Fridtjof Sander >Priority: Minor > > Non-associative aggregations (like `collect_list`) require the data to be > sorted on the grouping key in order to extract aggregation-groups. > Let `table` be a Hive-table, that is partitioned on `p` and bucketed and > sorted on `id`. Let `q` be a query, that executes a non-associative > aggregation on `table.id` over multiple partitions `p`. > Currently, when executing `q`, Spark creates as many RDD-partitions as there > are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the > associated buckets in all requested Hive-partitions. Because the buckets are > read one-by-one, the resulting RDD-partition is no longer sorted on `id` and > has to be explicitly sorted before performing the aggregation. Therefore an > execution-pipeline-block is introduced. > In this Jira I propose to offer an alternative bucket-fetching strategy to
[jira] [Created] (SPARK-17497) Preserve order when scanning ordered buckets over multiple partitions
Fridtjof Sander created SPARK-17497: --- Summary: Preserve order when scanning ordered buckets over multiple partitions Key: SPARK-17497 URL: https://issues.apache.org/jira/browse/SPARK-17497 Project: Spark Issue Type: Improvement Components: SQL Reporter: Fridtjof Sander Priority: Minor Non-associative aggregations (like `collect_list`) require the data to be sorted on the grouping key in order to extract aggregation-groups. Let `table` be a Hive-table, that is partitioned on `p` and bucketed and sorted on `id`. Let `q` be a query, that executes a non-associative aggregation on `table.id` over multiple partitions `p`. Currently, when executing `q`, Spark creates as many RDD-partitions as there are buckets. Each RDD-partition is created in `FileScanRDD`, by fetching the associated buckets in all requested Hive-partitions. Because the buckets are read one-by-one, the resulting RDD-partition is no longer sorted on `id` and has to be explicitly sorted before performing the aggregation. Therefore an execution-pipeline-block is introduced. In this Jira I propose to offer an alternative bucket-fetching strategy to the optimizer, that preserves the internal sorting in a situation described above. One way to achieve this, is to open all buckets over all partitions simultaneously when fetching the data. Since each bucket is internally sorted, we can perform basically a merge-sort on the collection of bucket-iterators, and directly emit a sorted RDD-partition, that can be piped into the next operator. While there should be no question about the theoretical feasibility of this idea, there are some obvious implications i.e. with regards to IO-handling. I would like to investigate the practical feasibility, limits, gains and drawbacks of this optimization in my masters-thesis and, of course, contribute the implementation. Before I start, however, I wanted to kindly ask you, the community, for any thoughts, opinions, corrections or other kinds of feedback, which is much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17389) KMeans speedup with better choice of k-means|| init steps = 2
[ https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481490#comment-15481490 ] Apache Spark commented on SPARK-17389: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/15050 > KMeans speedup with better choice of k-means|| init steps = 2 > - > > Key: SPARK-17389 > URL: https://issues.apache.org/jira/browse/SPARK-17389 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.1.0 > > > As reported in > http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 > KMeans can be surprisingly slow, and it's easy to see that most of the time > spent is in kmeans|| initialization. For example, in this simple example... > {code} > import org.apache.spark.mllib.random.RandomRDDs > import org.apache.spark.mllib.clustering.KMeans > val data = RandomRDDs.uniformVectorRDD(sc, 100, 64, > sc.defaultParallelism).cache() > data.count() > new KMeans().setK(1000).setMaxIterations(5).run(data) > {code} > Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init > takes about as long as 24 iterations, which is a typical run, meaning half > the time is just in picking cluster centers. This seems excessive. > There are two ways to speed this up significantly. First, the implementation > has an old "runs" parameter that is always 1 now. It used to allow multiple > clusterings to be computed at once. The code can be simplified significantly > now that runs=1 always. This is already covered by SPARK-11560, but just a > simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in > this example. That's not what this change is about though. > By default, k-means|| makes 5 passes over the data. The original paper at > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows > that 2 is plenty, certainly when l=2k as is the case in our implementation. > (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but > it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to > 1:41 -- much improved over 5:54. > Lastly, small thing, but the code will perform a local k-means++ step to > reduce the number of centers to k even if there are already only <= k > centers. This can be short-circuited. However this is really the topic of > SPARK-3261 because this can cause fewer than k clusters to be returned where > that would actually be correct, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17336) Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate
[ https://issues.apache.org/jira/browse/SPARK-17336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17336. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15028 [https://github.com/apache/spark/pull/15028] > Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate > - > > Key: SPARK-17336 > URL: https://issues.apache.org/jira/browse/SPARK-17336 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: anxu > Fix For: 2.0.1, 2.1.0 > > > On Spark start up by command: sbin/start-all.sh, the sbin/spark-config.sh > Repeated calls. In sbin/spark-config.sh code. > {code:title=sbin/spark-config.sh|borderStyle=solid} > # Add the PySpark classes to the PYTHONPATH: > export PYTHONPATH="${SPARK_HOME}/python:${PYTHONPATH}" > export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:${PYTHONPATH}" > {code} > {color:red}PYTHONPATH{color} has duplicate Value. > example: > {code:borderStyle=solid} > axu4iMac:spark-2.0.0-hadoop2.4 axu$ sbin/start-all.sh | grep PYTHONPATH > axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17336) Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate
[ https://issues.apache.org/jira/browse/SPARK-17336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17336: -- Assignee: Bryan Cutler Priority: Minor (was: Major) > Repeated calls sbin/spark-config.sh file Causes ${PYTHONPATH} Value duplicate > - > > Key: SPARK-17336 > URL: https://issues.apache.org/jira/browse/SPARK-17336 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: anxu >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > On Spark start up by command: sbin/start-all.sh, the sbin/spark-config.sh > Repeated calls. In sbin/spark-config.sh code. > {code:title=sbin/spark-config.sh|borderStyle=solid} > # Add the PySpark classes to the PYTHONPATH: > export PYTHONPATH="${SPARK_HOME}/python:${PYTHONPATH}" > export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.3-src.zip:${PYTHONPATH}" > {code} > {color:red}PYTHONPATH{color} has duplicate Value. > example: > {code:borderStyle=solid} > axu4iMac:spark-2.0.0-hadoop2.4 axu$ sbin/start-all.sh | grep PYTHONPATH > axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [6,16,31] [sbin/spark-config.sh] 定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(1): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > axu.print [Log] [7,17,32] [sbin/spark-config.sh] 再次定义PYTHONPATH > axu.print [sbin/spark-config.sh] [Define Global] PYTHONPATH(2): > [/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python/lib/py4j-0.10.1-src.zip:/Users/axu/code/axuProject/spark-2.0.0-hadoop2.4/python:] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17330) Clean up spark-warehouse in UT
[ https://issues.apache.org/jira/browse/SPARK-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17330. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14894 [https://github.com/apache/spark/pull/14894] > Clean up spark-warehouse in UT > -- > > Key: SPARK-17330 > URL: https://issues.apache.org/jira/browse/SPARK-17330 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: tone >Assignee: tone >Priority: Minor > Fix For: 2.1.0 > > > When run Spark UT based on the latest version of master branch, the UT case > (SPARK-8368) can be passed at the first time, but always fail if run it > again. The error log is as below: > [info] 2016-08-31 09:35:51.967 - stderr> 16/08/31 09:35:51 ERROR > RetryingHMSHandler: AlreadyExistsException(message:Database default already > exists) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [info] 2016-08-31 09:35:51.967 - stderr> at > java.lang.reflect.Method.invoke(Method.java:498) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > [info] 2016-08-31 09:35:51.967 - stderr> at > com.sun.proxy.$Proxy18.create_database(Unknown Source) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [info] 2016-08-31 09:35:51.967 - stderr> at > java.lang.reflect.Method.invoke(Method.java:498) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > [info] 2016-08-31 09:35:51.968 - stderr> at > com.sun.proxy.$Proxy19.createDatabase(Unknown Source) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:310) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:309) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:120) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:87) > [info] 2016-08-31 09:35:51.968 - stderr> at >
[jira] [Updated] (SPARK-17330) Clean up spark-warehouse in UT
[ https://issues.apache.org/jira/browse/SPARK-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17330: -- Assignee: tone Issue Type: Improvement (was: Bug) > Clean up spark-warehouse in UT > -- > > Key: SPARK-17330 > URL: https://issues.apache.org/jira/browse/SPARK-17330 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: tone >Assignee: tone >Priority: Minor > Fix For: 2.1.0 > > > When run Spark UT based on the latest version of master branch, the UT case > (SPARK-8368) can be passed at the first time, but always fail if run it > again. The error log is as below: > [info] 2016-08-31 09:35:51.967 - stderr> 16/08/31 09:35:51 ERROR > RetryingHMSHandler: AlreadyExistsException(message:Database default already > exists) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [info] 2016-08-31 09:35:51.967 - stderr> at > java.lang.reflect.Method.invoke(Method.java:498) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > [info] 2016-08-31 09:35:51.967 - stderr> at > com.sun.proxy.$Proxy18.create_database(Unknown Source) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > [info] 2016-08-31 09:35:51.967 - stderr> at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > [info] 2016-08-31 09:35:51.967 - stderr> at > java.lang.reflect.Method.invoke(Method.java:498) > [info] 2016-08-31 09:35:51.967 - stderr> at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > [info] 2016-08-31 09:35:51.968 - stderr> at > com.sun.proxy.$Proxy19.createDatabase(Unknown Source) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:310) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:310) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:281) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:270) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:309) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:120) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:120) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:87) > [info] 2016-08-31 09:35:51.968 - stderr> at > org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:119) >
[jira] [Resolved] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores
[ https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16834. --- Resolution: Won't Fix I think the answer is the same as in https://issues.apache.org/jira/browse/SPARK-16832 because it's apparently by design that these classes are deterministic by default. If you seed the split randomly it'll work as you expect. > TrainValildationSplit and direct evaluation produce different scores > > > Key: SPARK-16834 > URL: https://issues.apache.org/jira/browse/SPARK-16834 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz > > The two segments of code below are supposed to do the same thing: one is > using TrainValidationSplit, the other performs the same evaluation manually. > However, their results are statistically different (in my case, in a loop of > 20, I regularly get ~19 True values). > Unfortunately, I didn't find the bug in the source code. > {code} > dataset = spark.createDataFrame( > [(Vectors.dense([0.0]), 0.0), >(Vectors.dense([0.4]), 1.0), >(Vectors.dense([0.5]), 0.0), >(Vectors.dense([0.6]), 1.0), >(Vectors.dense([1.0]), 1.0)] * 1000, > ["features", "label"]).cache() > paramGrid = pyspark.ml.tuning.ParamGridBuilder().build() > # note that test is NEVER used in this code > # I create it only to utilize randomSplit > for i in range(20): > train, test = dataset.randomSplit([0.8, 0.2]) > tvs = > pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), > > estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), > trainRatio=0.5) > model = tvs.fit(train) > train, val, test = dataset.randomSplit([0.4, 0.4, 0.2]) > lr=pyspark.ml.regression.LinearRegression() > evaluator=pyspark.ml.evaluation.RegressionEvaluator() > lrModel = lr.fit(train) > predicted = lrModel.transform(val) > print(model.validationMetrics[0] < evaluator.evaluate(predicted)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17439) QuantilesSummaries returns the wrong result after compression
[ https://issues.apache.org/jira/browse/SPARK-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17439. --- Resolution: Fixed Assignee: Tim Hunter Fix Version/s: 2.1.0 2.0.1 Resolved by https://github.com/apache/spark/pull/15002 > QuantilesSummaries returns the wrong result after compression > - > > Key: SPARK-17439 > URL: https://issues.apache.org/jira/browse/SPARK-17439 > Project: Spark > Issue Type: Bug >Reporter: Tim Hunter >Assignee: Tim Hunter > Labels: correctness > Fix For: 2.0.1, 2.1.0 > > > [~clockfly] found the following corner case that returns the wrong quantile > (off by 1): > {code} > test("test QuantileSummaries compression") { > var left = new QuantileSummaries(1, 0.0001) > System.out.println("LEFT RIGHT") > System.out.println("") > (0 to 10).foreach { index => > left = left.insert(index) > left = left.compress() > var right = new QuantileSummaries(1, 0.0001) > (0 to index).foreach(right.insert(_)) > right = right.compress() > System.out.println(s"${left.query(0.5)} ${right.query(0.5)}") > } > } > {code} > The result is: > {code} > LEFT RIGHT > > 0.0 0.0 > 0.0 1.0 > 0.0 1.0 > 0.0 1.0 > 1.0 2.0 > 1.0 2.0 > 2.0 3.0 > 2.0 3.0 > 3.0 4.0 > 3.0 4.0 > 4.0 5.0 > {code} > The value of the "LEFT" column represents the output when using > QuantileSummaries in Window function, the value on the "RIGHT" column > represents the expected result. The different between "LEFT" and "RIGHT" > column is that the "LEFT" column does intermediate compression on the storage > of QuantileSummaries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17306) QuantileSummaries doesn't compress
[ https://issues.apache.org/jira/browse/SPARK-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17306. --- Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.1.0 2.0.1 Resolved by https://github.com/apache/spark/pull/15002 (based on https://github.com/apache/spark/pull/14976) > QuantileSummaries doesn't compress > -- > > Key: SPARK-17306 > URL: https://issues.apache.org/jira/browse/SPARK-17306 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Owen > Fix For: 2.0.1, 2.1.0 > > > compressThreshold was not referenced anywhere > {code} > class QuantileSummaries( > val compressThreshold: Int, > val relativeError: Double, > val sampled: ArrayBuffer[Stats] = ArrayBuffer.empty, > private[stat] var count: Long = 0L, > val headSampled: ArrayBuffer[Double] = ArrayBuffer.empty) extends > Serializable > {code} > And, it causes memory leak, QuantileSummaries takes unbounded memory > {code} > val summary = new QuantileSummaries(1, relativeError = 0.001) > // Results in creating an array of size 1 !!! > (1 to 1).foreach(summary.insert(_)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16834) TrainValildationSplit and direct evaluation produce different scores
[ https://issues.apache.org/jira/browse/SPARK-16834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481341#comment-15481341 ] Max Moroz commented on SPARK-16834: --- [~bryanc] thanks for looking into this. I have no doubt that my code can be modified to yield the same result as TrainValidationSplit by ensuring that it is (algorithmically) identical. But I thought it should behave nearly identically as it stands, without modification. In each iteration, the two metrics should of course differ. But the metrics should be random numbers drawn from two identical distributions. Unfortunately, it's not the case. I was thinking maybe there's a problem with permutation of the data (but it doesn't seem to be); or perhaps TVS, unlike randomSplit, results in precisely-sized individual slices (I doubt). > TrainValildationSplit and direct evaluation produce different scores > > > Key: SPARK-16834 > URL: https://issues.apache.org/jira/browse/SPARK-16834 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz > > The two segments of code below are supposed to do the same thing: one is > using TrainValidationSplit, the other performs the same evaluation manually. > However, their results are statistically different (in my case, in a loop of > 20, I regularly get ~19 True values). > Unfortunately, I didn't find the bug in the source code. > {code} > dataset = spark.createDataFrame( > [(Vectors.dense([0.0]), 0.0), >(Vectors.dense([0.4]), 1.0), >(Vectors.dense([0.5]), 0.0), >(Vectors.dense([0.6]), 1.0), >(Vectors.dense([1.0]), 1.0)] * 1000, > ["features", "label"]).cache() > paramGrid = pyspark.ml.tuning.ParamGridBuilder().build() > # note that test is NEVER used in this code > # I create it only to utilize randomSplit > for i in range(20): > train, test = dataset.randomSplit([0.8, 0.2]) > tvs = > pyspark.ml.tuning.TrainValidationSplit(estimator=pyspark.ml.regression.LinearRegression(), > > estimatorParamMaps=paramGrid, > > evaluator=pyspark.ml.evaluation.RegressionEvaluator(), > trainRatio=0.5) > model = tvs.fit(train) > train, val, test = dataset.randomSplit([0.4, 0.4, 0.2]) > lr=pyspark.ml.regression.LinearRegression() > evaluator=pyspark.ml.evaluation.RegressionEvaluator() > lrModel = lr.fit(train) > predicted = lrModel.transform(val) > print(model.validationMetrics[0] < evaluator.evaluate(predicted)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode
[ https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481314#comment-15481314 ] Sean Owen commented on SPARK-17493: --- I don't think this is enough info. What does 'hang' mean here, is there a reproduction? what do thread dumps show is going on? > Spark Job hangs while DataFrame writing to HDFS path with parquet mode > -- > > Key: SPARK-17493 > URL: https://issues.apache.org/jira/browse/SPARK-17493 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: AWS Cluster >Reporter: Gautam Solanki > > While saving a RDD to HDFS path in parquet format with the following > rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//") > , the spark job was hanging as the two write tasks with Shuffle Read of size > 0 could not complete. But, the executors notified the driver about the > completion of these two tasks. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kiran Lonikar updated SPARK-15687: -- Comment: was deleted (was: I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), ByteBuffer or even memory mapped files (mmap) based implementation for taking advantage of NVRAM memory based systems. Finally, it will be possible to use directly GPU RAM based implementation.) > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481272#comment-15481272 ] Kiran Lonikar commented on SPARK-15687: --- I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), ByteBuffer or even memory mapped files (mmap) based implementation for taking advantage of NVRAM memory based systems. Finally, it will be possible to use directly GPU RAM based implementation. > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481268#comment-15481268 ] Kiran Lonikar edited comment on SPARK-15687 at 9/11/16 7:29 AM: Evan Chan, I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), ByteBuffer or even memory mapped files (mmap) based implementation for taking advantage of NVRAM memory based systems. Finally, it will be possible to use directly GPU RAM based implementation. was (Author: klonikar): I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), ByteBuffer or even memory mapped files (mmap) based implementation for taking advantage of NVRAM memory based systems. Finally, it will be possible to use directly GPU RAM based implementation. > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17496) missing int to float coercion in df.sample() signature
[ https://issues.apache.org/jira/browse/SPARK-17496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481270#comment-15481270 ] Sean Owen commented on SPARK-17496: --- I don't think that's a bug. The value is virtually always in (0,1), so being able to specify 0 or 1 even is not typically useful. > missing int to float coercion in df.sample() signature > -- > > Key: SPARK-17496 > URL: https://issues.apache.org/jira/browse/SPARK-17496 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Trivial > > {code} > # works > spark.createDataFrame([[1], [2], [3]]).sample(True, 1.0) > # doesn't work > spark.createDataFrame([[1], [2], [3]]).sample(True, 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15687) Columnar execution engine
[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15481268#comment-15481268 ] Kiran Lonikar commented on SPARK-15687: --- I agree. It will then be possible to offer off-heap (sun.misc.Unsafe), ByteBuffer or even memory mapped files (mmap) based implementation for taking advantage of NVRAM memory based systems. Finally, it will be possible to use directly GPU RAM based implementation. > Columnar execution engine > - > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17389) KMeans speedup with better choice of k-means|| init steps = 2
[ https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17389. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14956 [https://github.com/apache/spark/pull/14956] > KMeans speedup with better choice of k-means|| init steps = 2 > - > > Key: SPARK-17389 > URL: https://issues.apache.org/jira/browse/SPARK-17389 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 2.1.0 > > > As reported in > http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 > KMeans can be surprisingly slow, and it's easy to see that most of the time > spent is in kmeans|| initialization. For example, in this simple example... > {code} > import org.apache.spark.mllib.random.RandomRDDs > import org.apache.spark.mllib.clustering.KMeans > val data = RandomRDDs.uniformVectorRDD(sc, 100, 64, > sc.defaultParallelism).cache() > data.count() > new KMeans().setK(1000).setMaxIterations(5).run(data) > {code} > Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init > takes about as long as 24 iterations, which is a typical run, meaning half > the time is just in picking cluster centers. This seems excessive. > There are two ways to speed this up significantly. First, the implementation > has an old "runs" parameter that is always 1 now. It used to allow multiple > clusterings to be computed at once. The code can be simplified significantly > now that runs=1 always. This is already covered by SPARK-11560, but just a > simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in > this example. That's not what this change is about though. > By default, k-means|| makes 5 passes over the data. The original paper at > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows > that 2 is plenty, certainly when l=2k as is the case in our implementation. > (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but > it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to > 1:41 -- much improved over 5:54. > Lastly, small thing, but the code will perform a local k-means++ step to > reduce the number of centers to k even if there are already only <= k > centers. This can be short-circuited. However this is really the topic of > SPARK-3261 because this can cause fewer than k clusters to be returned where > that would actually be correct, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org