[jira] [Commented] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException

2016-11-25 Thread tanxinz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15697237#comment-15697237
 ] 

tanxinz commented on SPARK-18584:
-

Two STS ran diffent Queue on yarn
etl Spark Thrift Server ran root.etl queue

dev Spark Thrift Server ran root.dev queue

I found spark Executor like this ,can it identify which users perform ?
3004 CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@machine_ip:33035 --executor-id 154 --hostname 
slave198 --cores 3 --app-id application_1479797390730_2433 --user-class-path 
file:/data5/yn_loc/usercache/etl/appcache/application_1479797390730_2433/container_1479797390730_2433_01_000608/__app__.jar

> multiple Spark Thrift Servers running in the same machine throws 
> org.apache.hadoop.security.AccessControlException
> --
>
> Key: SPARK-18584
> URL: https://issues.apache.org/jira/browse/SPARK-18584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0
> spark2.0.2
>Reporter: tanxinz
>
> In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the 
> same machine . I connected by beeline etl STS to execute a command,and 
> throwed org.apache.hadoop.security.AccessControlException.I don't know why is 
> dev user perform,not etl.
> ```
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denied: user=dev, access=EXECUTE, 
> inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::---
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18583) Fix nullability of InputFileName.

2016-11-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18583.
-
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 2.1.0

> Fix nullability of InputFileName.
> -
>
> Key: SPARK-18583
> URL: https://issues.apache.org/jira/browse/SPARK-18583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The nullability of {{InputFileName}} should be {{false}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException

2016-11-25 Thread tanxinz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15697216#comment-15697216
 ] 

tanxinz commented on SPARK-18584:
-

Different users have different authorizations to access different hdfs 
sources.Right now I have two users (etl , dev ),and running two Spark Thrift 
Server :

etl Spark Thrift Server:
/home/etl/app/spark-2.0.2-bin-spark_hadoop250/sbin/start-thriftserver.sh \
--hiveconf hive.server2.thrift.port=10111 \
--properties-file 
/home/etl/app/spark-2.0.2-bin-spark_hadoop250/conf/spark-etl.conf \
--conf spark.executor.instances=130 --name spark_etl

dev Spark Thrift Server:
/home/dev/app/spark-2.0.1-bin-spark_hadoop250/sbin/start-thriftserver.sh \
--hiveconf hive.server2.thrift.port=10001 \
--properties-file 
/home/dev/app/spark-2.0.1-bin-spark_hadoop250/conf/spark-dev.conf \
--driver-memory 10G \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.port=7337 \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5s \
--conf spark.dynamicAllocation.executorIdleTimeout=30s \
--name sparkedw_dynamic

When I connected by beeline etl STS to execute a command:
beeline  -u jdbc:hive2://machine_ip:10111  -n etl -p passwd  --verbose=true   
-e "${sql_text}"

Throwed org.apache.hadoop.security.AccessControlException.I don't know why is 
dev user perform,not etl.



> multiple Spark Thrift Servers running in the same machine throws 
> org.apache.hadoop.security.AccessControlException
> --
>
> Key: SPARK-18584
> URL: https://issues.apache.org/jira/browse/SPARK-18584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0
> spark2.0.2
>Reporter: tanxinz
>
> In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the 
> same machine . I connected by beeline etl STS to execute a command,and 
> throwed org.apache.hadoop.security.AccessControlException.I don't know why is 
> dev user perform,not etl.
> ```
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denied: user=dev, access=EXECUTE, 
> inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::---
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2016-11-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15697210#comment-15697210
 ] 

Takeshi Yamamuro commented on SPARK-18502:
--

Please give us a simple query to reproduce this?
I tried a simple query though, the query passed;
{code}
scala> val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("`k`ey`", 
"value")
df: org.apache.spark.sql.DataFrame = [`k`ey`: string, value: int]

scala> df.show
+--+-+
|`k`ey`|value|
+--+-+
| a|1|
| b|2|
| c|1|
| d|5|
+--+-+
{code}

> Spark does not handle columns that contain backquote (`)
> 
>
> Key: SPARK-18502
> URL: https://issues.apache.org/jira/browse/SPARK-18502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Barry Becker
>Priority: Minor
>
> I know that if a column contains dots or hyphens we can put 
> backquotes/backticks around it, but what if the column contains a backtick 
> (`)? Can the back tick be escaped by some means?
> Here is an example of the sort of error I see
> {code}
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90)
>  org.apache.spark.sql.Column.(Column.scala:113) 
> org.apache.spark.sql.Column$.apply(Column.scala:36) 
> org.apache.spark.sql.functions$.min(functions.scala:407) 
> com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158)
>  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator

2016-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15697176#comment-15697176
 ] 

Apache Spark commented on SPARK-17251:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16015

> "ClassCastException: OuterReference cannot be cast to NamedExpression" for 
> correlated subquery on the RHS of an IN operator
> ---
>
> Key: SPARK-17251
> URL: https://issues.apache.org/jira/browse/SPARK-17251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> The following test case produces a ClassCastException in the analyzer:
> {code}
> CREATE TABLE t1(a INTEGER);
> INSERT INTO t1 VALUES(1),(2);
> CREATE TABLE t2(b INTEGER);
> INSERT INTO t2 VALUES(1);
> SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2);
> {code}
> Here's the exception:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.NamedExpression
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> 

[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2016-11-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15697111#comment-15697111
 ] 

Takeshi Yamamuro commented on SPARK-18591:
--

If it's worth trying this, I'll do. I just made a prototype here; 
https://github.com/maropu/spark/commit/32b716cf02dfe8cba5b08b2dc3297bc061156630#diff-7d06cf071190dcbeda2fed6b039ec5d0R55

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2016-11-25 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-18591:


 Summary: Replace hash-based aggregates with sort-based ones if 
inputs already sorted
 Key: SPARK-18591
 URL: https://issues.apache.org/jira/browse/SPARK-18591
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.2
Reporter: Takeshi Yamamuro


Spark currently uses sort-based aggregates only in limited condition; the cases 
where spark cannot use partial aggregates and hash-based ones.
However, if input ordering has already satisfied the requirements of sort-based 
aggregates, it seems sort-based ones are faster than the other.
{code}
./bin/spark-shell --conf spark.sql.shuffle.partitions=1

val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
value").sort($"key").cache

def timer[R](block: => R): R = {
  val t0 = System.nanoTime()
  val result = block
  val t1 = System.nanoTime()
  println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
  result
}

timer {
  df.groupBy("key").count().count
}

// codegen'd hash aggregate
Elapsed time: 7.116962977s

// non-codegen'd sort aggregarte
Elapsed time: 3.088816662s
{code}

If codegen'd sort-based aggregates are supported in SPARK-16844, this seems to 
make the performance gap bigger;
{code}
- codegen'd sort aggregate
Elapsed time: 1.645234684s
{code} 

Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18405) Add yarn-cluster mode support to Spark Thrift Server

2016-11-25 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696978#comment-15696978
 ] 

Jeff Zhang edited comment on SPARK-18405 at 11/26/16 1:01 AM:
--

I think he mean to launch multiple spark thrift server in yarn-cluster mode so 
each we can evenly distribute workload across them(e.g. one spark thrift server 
per department).  And it is much easier to request large container for AM in 
yarn-cluster while the the memory of the host in yarn-client mode may be 
limited and without capacity control under yarn. 


was (Author: zjffdu):
I think he mean to launch multiple spark thrift server in yarn-cluster mode so 
each we can evenly distribute workload across them.  And it is much easier to 
request large container for AM in yarn-cluster while the the memory of the host 
in yarn-client mode may be limited and without capacity control under yarn. 

> Add yarn-cluster mode support to Spark Thrift Server
> 
>
> Key: SPARK-18405
> URL: https://issues.apache.org/jira/browse/SPARK-18405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.2, 2.0.0, 2.0.1
>Reporter: Prabhu Kasinathan
>  Labels: Spark, ThriftServer2
>
> Currently, spark thrift server can run only on yarn-client mode. 
> Can we add Yarn-Cluster mode support to spark thrift server?
> This will help us to launch multiple spark thrift server with different spark 
> configurations and it really help in large distributed clusters where there 
> is requirement to run complex sqls through STS. With client mode, there is a 
> chance to overload local host with too much driver memory. 
> Please let me know your thoughts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18405) Add yarn-cluster mode support to Spark Thrift Server

2016-11-25 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696978#comment-15696978
 ] 

Jeff Zhang commented on SPARK-18405:


I think he mean to launch multiple spark thrift server in yarn-cluster mode so 
each we can evenly distribute workload across them.  And it is much easier to 
request large container for AM in yarn-cluster while the the memory of the host 
in yarn-client mode may be limited and without capacity control under yarn. 

> Add yarn-cluster mode support to Spark Thrift Server
> 
>
> Key: SPARK-18405
> URL: https://issues.apache.org/jira/browse/SPARK-18405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.2, 2.0.0, 2.0.1
>Reporter: Prabhu Kasinathan
>  Labels: Spark, ThriftServer2
>
> Currently, spark thrift server can run only on yarn-client mode. 
> Can we add Yarn-Cluster mode support to spark thrift server?
> This will help us to launch multiple spark thrift server with different spark 
> configurations and it really help in large distributed clusters where there 
> is requirement to run complex sqls through STS. With client mode, there is a 
> chance to overload local host with too much driver memory. 
> Please let me know your thoughts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18590:


Assignee: Felix Cheung  (was: Apache Spark)

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Blocker
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696848#comment-15696848
 ] 

Apache Spark commented on SPARK-18590:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16014

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Blocker
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18590:


Assignee: Apache Spark  (was: Felix Cheung)

> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Blocker
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-25 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18590:
-
Description: 
We should include in Spark distribution the built source package for SparkR. 
This will enable help and vignettes when the package is used. Also this source 
package is what we would release to CRAN.


> R - Include package vignettes and help pages, build source package in Spark 
> distribution
> 
>
> Key: SPARK-18590
> URL: https://issues.apache.org/jira/browse/SPARK-18590
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Blocker
>
> We should include in Spark distribution the built source package for SparkR. 
> This will enable help and vignettes when the package is used. Also this 
> source package is what we would release to CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18590) R - Include package vignettes and help pages, build source package in Spark distribution

2016-11-25 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18590:


 Summary: R - Include package vignettes and help pages, build 
source package in Spark distribution
 Key: SPARK-18590
 URL: https://issues.apache.org/jira/browse/SPARK-18590
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2016-11-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696717#comment-15696717
 ] 

Nicholas Chammas commented on SPARK-18589:
--

cc [~davies] [~hvanhovell]

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> 

[jira] [Created] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2016-11-25 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18589:


 Summary: persist() resolves "java.lang.RuntimeException: Invalid 
PythonUDF (...), requires attributes from more than one child"
 Key: SPARK-18589
 URL: https://issues.apache.org/jira/browse/SPARK-18589
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.2, 2.1.0
 Environment: Python 3.5, Java 8
Reporter: Nicholas Chammas
Priority: Minor


Smells like another optimizer bug that's similar to SPARK-17100 and 
SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
{{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.

I don't have a minimal repro for this yet, but the error I'm seeing is:

{code}
py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
: java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires attributes 
from more than one child.
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
at scala.collection.immutable.Stream.foreach(Stream.scala:594)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2555)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[jira] [Resolved] (SPARK-18436) isin causing SQL syntax error with JDBC

2016-11-25 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18436.
---
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.1.0
   2.0.3

> isin causing SQL syntax error with JDBC
> ---
>
> Key: SPARK-18436
> URL: https://issues.apache.org/jira/browse/SPARK-18436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Linux, SQL Server 2012
>Reporter: Dan
>Assignee: Jiang Xingbo
>  Labels: jdbc, sql
> Fix For: 2.0.3, 2.1.0
>
>
> When using a JDBC data source, the "isin" function generates invalid SQL 
> syntax when called with an empty array, which causes the JDBC driver to throw 
> an exception.
> If the array is not empty, it works fine.
> In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and 
> TABLE are all correctly defined.
> {noformat}
> scala> val filter = Array[String]()
> filter: Array[String] = Array()
> scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> 
> SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> 
> TABLE)).load().filter($"cl_ult".isin(filter:_*))
> sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields]
> scala> sortDF.show()
> 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205)
> com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double

2016-11-25 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696598#comment-15696598
 ] 

Thomas Sebastian commented on SPARK-18527:
--

I am interested to work on this.

> UDAFPercentile (bigint, array) needs explicity cast to double
> -
>
> Key: SPARK-18527
> URL: https://issues.apache.org/jira/browse/SPARK-18527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell
>Reporter: Fabian Boehnlein
>
> Same bug as SPARK-16228 but 
> {code}_FUNC_(bigint, array) {code}
> instead of 
> {code}_FUNC_(bigint, double){code}
> Fix of SPARK-16228 only fixes the non-array case that was hit.
> {code}
> sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")
> {code}
> fails in Spark 2 shell.
> Longer example
> {code}
> case class Record(key: Long, value: String)
> val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, 
> s"val_$i")))
> recordsDF.createOrReplaceTempView("records")
> sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 
> 0.2, 0.1)) AS test FROM records")
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.had
> oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible 
> choices: _FUNC_(bigint, array)  _FUNC_(bigint, double)  ; line 1 pos 7
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164)
>   at 
> org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18543) SaveAsTable(CTAS) using overwrite could change table definition

2016-11-25 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696586#comment-15696586
 ] 

Thomas Sebastian commented on SPARK-18543:
--

I would like to take a look at this fix, if you have not already started.

> SaveAsTable(CTAS) using overwrite could change table definition
> ---
>
> Key: SPARK-18543
> URL: https://issues.apache.org/jira/browse/SPARK-18543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When the mode is OVERWRITE, we drop the Hive serde tables and create a data 
> source table. This is not right. 
> {code}
> val tableName = "tab1"
> withTable(tableName) {
>   sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS 
> key, 'abc' AS value")
>   val df = sql(s"SELECT key, value FROM $tableName")
>   df.write.mode(SaveMode.Overwrite).saveAsTable(tableName)
>   val tableMeta = 
> spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))
>   assert(tableMeta.provider == 
> Some(spark.sessionState.conf.defaultDataSourceName))
> }
> {code}
> Based on the definition of OVERWRITE, no change should be made on the table 
> definition. When recreate the table, we need to create a Hive serde table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696541#comment-15696541
 ] 

Herman van Hovell commented on SPARK-18220:
---

I tried reproducing this but to no avail. [~jerryjung] could you give us a 
reproducible example?

> ClassCastException occurs when using select query on ORC file
> -
>
> Key: SPARK-18220
> URL: https://issues.apache.org/jira/browse/SPARK-18220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jerryjung
>  Labels: orcfile, sql
>
> Error message is below.
> {noformat}
> ==
> 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from 
> hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
> 1220 bytes result sent to driver
> 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
> 42) in 116 ms on localhost (executor driver) (19/20)
> 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 
> 35)
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ORC dump info.
> ==
> File Version: 0.12 with HIVE_8732
> 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from 
> hdfs://XXX/part-0 with {include: null, offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on 
> read. Using file schema.
> Rows: 7
> Compression: ZLIB
> Compression size: 262144
> Type: 
> struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18487) Add task completion listener to HashAggregate to avoid memory leak

2016-11-25 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696458#comment-15696458
 ] 

Reynold Xin commented on SPARK-18487:
-

As discussed on the pull request, this is not an issue.


> Add task completion listener to HashAggregate to avoid memory leak
> --
>
> Key: SPARK-18487
> URL: https://issues.apache.org/jira/browse/SPARK-18487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The methods such as Dataset.show and take use Limit (CollectLimitExec) which 
> leverages SparkPlan.executeTake to efficiently collect required number of 
> elements back to the driver.
> However, under wholestage codege, we usually release resources after all 
> elements are consumed (e.g., HashAggregate). In this case, we will not 
> release the resources and cause memory leak with Dataset.show, for example.
> We can add task completion listener to HashAggregate to avoid the memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18487) Add task completion listener to HashAggregate to avoid memory leak

2016-11-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-18487.
---
Resolution: Not A Problem

> Add task completion listener to HashAggregate to avoid memory leak
> --
>
> Key: SPARK-18487
> URL: https://issues.apache.org/jira/browse/SPARK-18487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The methods such as Dataset.show and take use Limit (CollectLimitExec) which 
> leverages SparkPlan.executeTake to efficiently collect required number of 
> elements back to the driver.
> However, under wholestage codege, we usually release resources after all 
> elements are consumed (e.g., HashAggregate). In this case, we will not 
> release the resources and cause memory leak with Dataset.show, for example.
> We can add task completion listener to HashAggregate to avoid the memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2016-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696373#comment-15696373
 ] 

Apache Spark commented on SPARK-3359:
-

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16013

> `sbt/sbt unidoc` doesn't work with Java 8
> -
>
> Key: SPARK-3359
> URL: https://issues.apache.org/jira/browse/SPARK-3359
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> It seems that Java 8 is stricter on JavaDoc. I got many error messages like
> {code}
> [error] 
> /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
>  error: modifier private not allowed here
> [error] private abstract interface SparkHadoopMapRedUtil {
> [error]  ^
> {code}
> This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-25 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696337#comment-15696337
 ] 

Burak Yavuz commented on SPARK-18407:
-

This is also resolved as part of 
https://issues.apache.org/jira/browse/SPARK-18510

> Inferred partition columns cause assertion error
> 
>
> Key: SPARK-18407
> URL: https://issues.apache.org/jira/browse/SPARK-18407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Michael Armbrust
>Priority: Critical
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>   s"""
>  |Batch does not have expected schema
>  |Expected: ${output.mkString(",")}
>  |Actual: ${newPlan.output.mkString(",")}
>  |
>  |== Original ==
>  |$logicalPlan
>  |
>  |== Batch ==
>  |$newPlan
>""".stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696321#comment-15696321
 ] 

Herman van Hovell edited comment on SPARK-17788 at 11/25/16 5:09 PM:
-

That is fair. The solution is not that straightforward TBH:
- Always add some kind of tie breaking value to the range. This could be 
random, but I'd rather add something like monotonically_increasing_id(). This 
always incurs some cost.
- Only add a tie-breaker when the you have (suspect) skew. Here we need to add 
some heavy hitter algorithm, which is potentially much more resource intensive 
than reservoir sampling. The other thing is that when we suspect skew, we would 
need to scan the data again (which would make the total of scans 3).

So I would be slightly in favor of option 1 and a flag to disable it.


was (Author: hvanhovell):
That is fair. The solution is not that straightforward TBH:
- Always add some kind of tie breaking value to the range. This could be 
random, but I'd rather add something like monotonically_increasing_id(). This 
always incurs some cost.
- Only add a tie-breaker when the you have (suspect) skew. Here we need to add 
some heavy hitter algorithm, which is potentially much more resource intensive 
than reservoir sampling. The other thing is that when we suspect skew, we would 
need to scan the data again (which would make the total of scans 3).
So I would be slightly in favor of option 1 and a flag to disable it.

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696321#comment-15696321
 ] 

Herman van Hovell commented on SPARK-17788:
---

That is fair. The solution is not that straightforward TBH:
- Always add some kind of tie breaking value to the range. This could be 
random, but I'd rather add something like monotonically_increasing_id(). This 
always incurs some cost.
- Only add a tie-breaker when the you have (suspect) skew. Here we need to add 
some heavy hitter algorithm, which is potentially much more resource intensive 
than reservoir sampling. The other thing is that when we suspect skew, we would 
need to scan the data again (which would make the total of scans 3).
So I would be slightly in favor of option 1 and a flag to disable it.

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696134#comment-15696134
 ] 

Herman van Hovell edited comment on SPARK-17788 at 11/25/16 4:56 PM:
-

Spark makes a sketch of your data as soon when you want to order the entire 
dataset. Based on that sketch Spark tries to create equally sized partitions. 
As [~holdenk] said, your problem is caused by skew (a lot of rows with the same 
key), and none of the current partitioning schemes can help you with this. On 
the short run, you could follow her suggestion and add noise to the order (this 
only works for global ordering and not for joins/aggregation with skewed 
values). On the long run, there is an ongoing effort to reduce skew for 
joining, see SPARK-9862 for more information.

I have created the follow little spark program to illustrate how range 
partitioning works:
{noformat}
import org.apache.spark.sql.Row

// Set the partitions and parallelism to relatively low value so we can read 
the results.
spark.conf.set("spark.default.parallelism", "20")
spark.conf.set("spark.sql.shuffle.partitions", "20")

// Create a skewed data frame.
val df = spark
  .range(1000)
  .select(
$"id",
(rand(34) * when($"id" % 10 <= 7, 
lit(1.0)).otherwise(lit(10.0))).as("value"))

// Make a summary per partition. The partition intervals should not overlap and 
the number of
// elements in a partition should roughly be the same for all partitions.
case class PartitionSummary(count: Long, min: Double, max: Double, range: 
Double)
val res = df.orderBy($"value").mapPartitions { iterator =>
  val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, 
Double.NegativeInfinity)) {
case ((count, min, max), Row(_, value: Double)) =>
  (count + 1L, Math.min(min, value), Math.max(max, value))
  }
  Iterator.single(PartitionSummary(count, min, max, max - min))
}

// Get results and make them look nice
res.orderBy($"min")
  .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), 
$"range".cast("decimal(5,3)"))
  .show(30)
{noformat}

This yields the following results (notice how the partition range varies and 
the row count is relatively similar):
{noformat}
+--+-+--+-+ 
| count|  min|   max|range|
+--+-+--+-+
|484005|0.000| 0.059|0.059|
|426212|0.059| 0.111|0.052|
|381796|0.111| 0.157|0.047|
|519954|0.157| 0.221|0.063|
|496842|0.221| 0.281|0.061|
|539082|0.281| 0.347|0.066|
|516798|0.347| 0.410|0.063|
|558487|0.410| 0.478|0.068|
|419825|0.478| 0.529|0.051|
|402257|0.529| 0.578|0.049|
|557225|0.578| 0.646|0.068|
|518626|0.646| 0.710|0.063|
|611478|0.710| 0.784|0.075|
|544556|0.784| 0.851|0.066|
|454356|0.851| 0.906|0.055|
|450535|0.906| 0.961|0.055|
|575996|0.961| 2.290|1.329|
|525915|2.290| 4.920|2.630|
|518757|4.920| 7.510|2.590|
|497298|7.510|10.000|2.490|
+--+-+--+-+
{noformat}


was (Author: hvanhovell):
Spark makes a sketch of your data as soon when you want to order the entire 
dataset. Based on that sketch Spark tries to create equally sized partitions. 
As [~holdenk] said, your problem is caused by skew (a lot of rows with the same 
key), and none of the current partitioning schemes can help you with this. On 
the short run, you could follow her suggestion and add noise to the order (this 
only works for global ordering and not for joins/aggregation with skewed 
values). On the long run, there is an ongoing effort to reduce skew for 
joining, see SPARK-9862 for more information.

I have creates the follow little spark program to illustrate how range 
partitioning works:
{noformat}
import org.apache.spark.sql.Row

// Set the partitions and parallelism to relatively low value so we can read 
the results.
spark.conf.set("spark.default.parallelism", "20")
spark.conf.set("spark.sql.shuffle.partitions", "20")

// Create a skewed data frame.
val df = spark
  .range(1000)
  .select(
$"id",
(rand(34) * when($"id" % 10 <= 7, 
lit(1.0)).otherwise(lit(10.0))).as("value"))

// Make a summary per partition. The partition intervals should not overlap and 
the number of
// elements in a partition should roughly be the same for all partitions.
case class PartitionSummary(count: Long, min: Double, max: Double, range: 
Double)
val res = df.orderBy($"value").mapPartitions { iterator =>
  val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, 
Double.NegativeInfinity)) {
case ((count, min, max), Row(_, value: Double)) =>
  (count + 1L, Math.min(min, value), Math.max(max, value))
  }
  Iterator.single(PartitionSummary(count, min, max, max - min))
}

// Get results and make them look nice
res.orderBy($"min")
  .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), 
$"range".cast("decimal(5,3)"))
  .show(30)
{noformat}

This yields the following 

[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696134#comment-15696134
 ] 

Herman van Hovell edited comment on SPARK-17788 at 11/25/16 4:10 PM:
-

Spark makes a sketch of your data as soon when you want to order the entire 
dataset. Based on that sketch Spark tries to create equally sized partitions. 
As [~holdenk] said, your problem is caused by skew (a lot of rows with the same 
key), and none of the current partitioning schemes can help you with this. On 
the short run, you could follow her suggestion and add noise to the order (this 
only works for global ordering and not for joins/aggregation with skewed 
values). On the long run, there is an ongoing effort to reduce skew for 
joining, see SPARK-9862 for more information.

I have creates the follow little spark program to illustrate how range 
partitioning works:
{noformat}
import org.apache.spark.sql.Row

// Set the partitions and parallelism to relatively low value so we can read 
the results.
spark.conf.set("spark.default.parallelism", "20")
spark.conf.set("spark.sql.shuffle.partitions", "20")

// Create a skewed data frame.
val df = spark
  .range(1000)
  .select(
$"id",
(rand(34) * when($"id" % 10 <= 7, 
lit(1.0)).otherwise(lit(10.0))).as("value"))

// Make a summary per partition. The partition intervals should not overlap and 
the number of
// elements in a partition should roughly be the same for all partitions.
case class PartitionSummary(count: Long, min: Double, max: Double, range: 
Double)
val res = df.orderBy($"value").mapPartitions { iterator =>
  val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, 
Double.NegativeInfinity)) {
case ((count, min, max), Row(_, value: Double)) =>
  (count + 1L, Math.min(min, value), Math.max(max, value))
  }
  Iterator.single(PartitionSummary(count, min, max, max - min))
}

// Get results and make them look nice
res.orderBy($"min")
  .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), 
$"range".cast("decimal(5,3)"))
  .show(30)
{noformat}

This yields the following results (notice how the partition range varies and 
the row count is relatively similar):
{noformat}
+--+-+--+-+ 
| count|  min|   max|range|
+--+-+--+-+
|484005|0.000| 0.059|0.059|
|426212|0.059| 0.111|0.052|
|381796|0.111| 0.157|0.047|
|519954|0.157| 0.221|0.063|
|496842|0.221| 0.281|0.061|
|539082|0.281| 0.347|0.066|
|516798|0.347| 0.410|0.063|
|558487|0.410| 0.478|0.068|
|419825|0.478| 0.529|0.051|
|402257|0.529| 0.578|0.049|
|557225|0.578| 0.646|0.068|
|518626|0.646| 0.710|0.063|
|611478|0.710| 0.784|0.075|
|544556|0.784| 0.851|0.066|
|454356|0.851| 0.906|0.055|
|450535|0.906| 0.961|0.055|
|575996|0.961| 2.290|1.329|
|525915|2.290| 4.920|2.630|
|518757|4.920| 7.510|2.590|
|497298|7.510|10.000|2.490|
+--+-+--+-+
{noformat}


was (Author: hvanhovell):
Spark makes a sketch of your data as soon when you want to order the entire 
dataset. Based on that sketch Spark tries to create equally sized partitions. 
As [~holdenk]] said, your problem is caused by skew (a lot of rows with the 
same key), and none of the current partitioning schemes can help you with this. 
On the short run, you could follow her suggestion and add noise to the order 
(this only works for global ordering and not for joins/aggregation with skewed 
values). On the long run, there is an ongoing effort to reduce skew for 
joining, see SPARK-9862 for more information.

I have creates the follow little spark program to illustrate how range 
partitioning works:
{noformat}
import org.apache.spark.sql.Row

// Set the partitions and parallelism to relatively low value so we can read 
the results.
spark.conf.set("spark.default.parallelism", "20")
spark.conf.set("spark.sql.shuffle.partitions", "20")

// Create a skewed data frame.
val df = spark
  .range(1000)
  .select(
$"id",
(rand(34) * when($"id" % 10 <= 7, 
lit(1.0)).otherwise(lit(10.0))).as("value"))

// Make a summary per partition. The partition intervals should not overlap and 
the number of
// elements in a partition should roughly be the same for all partitions.
case class PartitionSummary(count: Long, min: Double, max: Double, range: 
Double)
val res = df.orderBy($"value").mapPartitions { iterator =>
  val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, 
Double.NegativeInfinity)) {
case ((count, min, max), Row(_, value: Double)) =>
  (count + 1L, Math.min(min, value), Math.max(max, value))
  }
  Iterator.single(PartitionSummary(count, min, max, max - min))
}

// Get results and make them look nice
res.orderBy($"min")
  .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), 
$"range".cast("decimal(5,3)"))
  .show(30)
{noformat}

This yields the following 

[jira] [Reopened] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reopened SPARK-17788:
-

This is somewhat distinct from the join case, but certainly related.

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696188#comment-15696188
 ] 

holdenk commented on SPARK-17788:
-

I don't think this is a duplicate - its related but a join doesn't necessarily 
use a range partitioner and sortBy is a different operation. I agree the 
potential solution could share a lot the same underlying implementation.

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18220) ClassCastException occurs when using select query on ORC file

2016-11-25 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18220:
--
Description: 
Error message is below.
{noformat}
==
16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from hdfs://xxx/part-00022 
with {include: [true], offset: 0, length: 9223372036854775807}
16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
1220 bytes result sent to driver
16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
42) in 116 ms on localhost (executor driver) (19/20)
16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 35)
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
org.apache.hadoop.io.Text
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

ORC dump info.
==
File Version: 0.12 with HIVE_8732
16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from 
hdfs://XXX/part-0 with {include: null, offset: 0, length: 
9223372036854775807}
16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on 
read. Using file schema.
Rows: 7
Compression: ZLIB
Compression size: 262144
Type: 
struct
{noformat}

  was:
Error message is below.
==
16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from hdfs://xxx/part-00022 
with {include: [true], offset: 0, length: 9223372036854775807}
16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
1220 bytes result sent to driver
16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
42) in 116 ms on localhost (executor driver) (19/20)
16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 35)
java.lang.ClassCastException: 
org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
org.apache.hadoop.io.Text
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
at 
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)

[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696154#comment-15696154
 ] 

Herman van Hovell commented on SPARK-17788:
---

I am closing this one as a duplicate. Feel free to reopen if you disagree.

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-17788.
-
Resolution: Duplicate

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696134#comment-15696134
 ] 

Herman van Hovell commented on SPARK-17788:
---

Spark makes a sketch of your data as soon when you want to order the entire 
dataset. Based on that sketch Spark tries to create equally sized partitions. 
As [~holdenk]] said, your problem is caused by skew (a lot of rows with the 
same key), and none of the current partitioning schemes can help you with this. 
On the short run, you could follow her suggestion and add noise to the order 
(this only works for global ordering and not for joins/aggregation with skewed 
values). On the long run, there is an ongoing effort to reduce skew for 
joining, see SPARK-9862 for more information.

I have creates the follow little spark program to illustrate how range 
partitioning works:
{noformat}
import org.apache.spark.sql.Row

// Set the partitions and parallelism to relatively low value so we can read 
the results.
spark.conf.set("spark.default.parallelism", "20")
spark.conf.set("spark.sql.shuffle.partitions", "20")

// Create a skewed data frame.
val df = spark
  .range(1000)
  .select(
$"id",
(rand(34) * when($"id" % 10 <= 7, 
lit(1.0)).otherwise(lit(10.0))).as("value"))

// Make a summary per partition. The partition intervals should not overlap and 
the number of
// elements in a partition should roughly be the same for all partitions.
case class PartitionSummary(count: Long, min: Double, max: Double, range: 
Double)
val res = df.orderBy($"value").mapPartitions { iterator =>
  val (count, min, max) = iterator.foldLeft((0L, Double.PositiveInfinity, 
Double.NegativeInfinity)) {
case ((count, min, max), Row(_, value: Double)) =>
  (count + 1L, Math.min(min, value), Math.max(max, value))
  }
  Iterator.single(PartitionSummary(count, min, max, max - min))
}

// Get results and make them look nice
res.orderBy($"min")
  .select($"count", $"min".cast("decimal(5,3)"), $"max".cast("decimal(5,3)"), 
$"range".cast("decimal(5,3)"))
  .show(30)
{noformat}

This yields the following results (notice how the partition range varies and 
the row count is relatively similar):
{noformat}
+--+-+--+-+ 
| count|  min|   max|range|
+--+-+--+-+
|484005|0.000| 0.059|0.059|
|426212|0.059| 0.111|0.052|
|381796|0.111| 0.157|0.047|
|519954|0.157| 0.221|0.063|
|496842|0.221| 0.281|0.061|
|539082|0.281| 0.347|0.066|
|516798|0.347| 0.410|0.063|
|558487|0.410| 0.478|0.068|
|419825|0.478| 0.529|0.051|
|402257|0.529| 0.578|0.049|
|557225|0.578| 0.646|0.068|
|518626|0.646| 0.710|0.063|
|611478|0.710| 0.784|0.075|
|544556|0.784| 0.851|0.066|
|454356|0.851| 0.906|0.055|
|450535|0.906| 0.961|0.055|
|575996|0.961| 2.290|1.329|
|525915|2.290| 4.920|2.630|
|518757|4.920| 7.510|2.590|
|497298|7.510|10.000|2.490|
+--+-+--+-+
{noformat}

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E


[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696056#comment-15696056
 ] 

Hao Ren commented on SPARK-18581:
-

Thank you for the clarification. I totally missed that part.

I will compare the result to R.

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696030#comment-15696030
 ] 

Sean Owen commented on SPARK-18581:
---

The PDF largest at the mean, and it can be > 1 if the determinant of the 
covariance matrix is sufficiently small. This is like the univariate case where 
the variance is small - the distribution is very "peaked" at the mean and the 
PDF gets arbitrarily high.

See for example https://www.wolframalpha.com/input/?i=Plot+N(0,1e-10)

Can you compare this to results you might get with R or something to see if the 
numbers match? numerical accuracy does become an issue as the matrix gets 
near-singular but that's what this cutoff is supposed to help address.

We might be able to rearrange some of the math for better accuracy too. But 
let's first verify there's an issue.

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6522) Standardize Random Number Generation

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696024#comment-15696024
 ] 

holdenk commented on SPARK-6522:


We have a standardized RDD generator in MLlib (see the RandomRDDs object).

> Standardize Random Number Generation
> 
>
> Key: SPARK-6522
> URL: https://issues.apache.org/jira/browse/SPARK-6522
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: RJ Nowling
>Priority: Minor
> Fix For: 1.1.0
>
>
> Generation of random numbers in Spark has to be handled carefully since 
> references to RNGs copy the state to the workers.  As such, a separate RNG 
> needs to be seeded for each partition.  Each time random numbers are used in 
> Spark's libraries, the RNG seeding is re-implemented, leaving open the 
> possibility of mistakes.
> It would be useful if RNG seeding was standardized through utility functions 
> or random number generation functions that can be called in Spark pipelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6522) Standardize Random Number Generation

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-6522.
--
   Resolution: Fixed
Fix Version/s: 1.1.0

> Standardize Random Number Generation
> 
>
> Key: SPARK-6522
> URL: https://issues.apache.org/jira/browse/SPARK-6522
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: RJ Nowling
>Priority: Minor
> Fix For: 1.1.0
>
>
> Generation of random numbers in Spark has to be handled carefully since 
> references to RNGs copy the state to the workers.  As such, a separate RNG 
> needs to be seeded for each partition.  Each time random numbers are used in 
> Spark's libraries, the RNG seeding is re-implemented, leaving open the 
> possibility of mistakes.
> It would be useful if RNG seeding was standardized through utility functions 
> or random number generation functions that can be called in Spark pipelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696016#comment-15696016
 ] 

holdenk commented on SPARK-5997:


That could work, although we'd probably want a different API and we'd need to 
be clear that the result doesn't have a known partitioner.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3348) Support user-defined SparkListeners properly

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-3348.

Resolution: Duplicate

> Support user-defined SparkListeners properly
> 
>
> Key: SPARK-3348
> URL: https://issues.apache.org/jira/browse/SPARK-3348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> Because of the current initialization ordering, user-defined SparkListeners 
> do not receive certain events that are posted before application code is run. 
> We need to expose a constructor that allows the given SparkListeners to 
> receive all events.
> There have been interest in this for a while, but I have searched through the 
> JIRA history and have not found a related issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5190) Allow spark listeners to be added before spark context gets initialized.

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15696003#comment-15696003
 ] 

holdenk commented on SPARK-5190:


This seems to be fixed, but we forgot to close (cc [~joshrosen])

> Allow spark listeners to be added before spark context gets initialized.
> 
>
> Key: SPARK-5190
> URL: https://issues.apache.org/jira/browse/SPARK-5190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Kostas Sakellis
>
> Currently, you need the spark context to add spark listener events. But, if 
> you wait until the spark context gets created before adding your listener you 
> might miss events like blockManagerAdded or executorAdded. We should fix this 
> so you can attach a listener to the spark context before it starts any 
> initialization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-11-25 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-18588:
-

 Summary: KafkaSourceStressForDontFailOnDataLossSuite is flaky
 Key: SPARK-18588
 URL: https://issues.apache.org/jira/browse/SPARK-18588
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Herman van Hovell


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18588) KafkaSourceStressForDontFailOnDataLossSuite is flaky

2016-11-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695975#comment-15695975
 ] 

Herman van Hovell commented on SPARK-18588:
---

cc [~zsxwing]

> KafkaSourceStressForDontFailOnDataLossSuite is flaky
> 
>
> Key: SPARK-18588
> URL: https://issues.apache.org/jira/browse/SPARK-18588
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Herman van Hovell
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite_name=stress+test+for+failOnDataLoss%3Dfalse



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695967#comment-15695967
 ] 

holdenk commented on SPARK-636:
---

If you have a logging system you want to initialize wouldn't using an object 
with lazy initialization on call be sufficient?

> Add mechanism to run system management/configuration tasks on all workers
> -
>
> Key: SPARK-636
> URL: https://issues.apache.org/jira/browse/SPARK-636
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It would be useful to have a mechanism to run a task on all workers in order 
> to perform system management tasks, such as purging caches or changing system 
> properties.  This is useful for automated experiments and benchmarking; I 
> don't envision this being used for heavy computation.
> Right now, I can mimic this with something like
> {code}
> sc.parallelize(0 until numMachines, numMachines).foreach { } 
> {code}
> but this does not guarantee that every worker runs a task and requires my 
> user code to know the number of workers.
> One sample use case is setup and teardown for benchmark tests.  For example, 
> I might want to drop cached RDDs, purge shuffle data, and call 
> {{System.gc()}} between test runs.  It makes sense to incorporate some of 
> this functionality, such as dropping cached RDDs, into Spark itself, but it 
> might be helpful to have a general mechanism for running ad-hoc tasks like 
> {{System.gc()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695956#comment-15695956
 ] 

holdenk commented on SPARK-17788:
-

This is semi-expected behaviour of the range partitioner (and really all Spark 
partitioners) don't support creating a split on the same key (e.g. 70% of your 
data has the same key and you are partitioning on that key 70% of that day is 
going to end up in the same partition).

We could try and fix this in a few ways - either by having Spark SQL do 
something special in this case or having Spark's sortBy automatically add 
"noise" to the key when the sampling indicates there is too much data for a 
given key or allowing partitioners to be non-determinstic and updating the 
general sortBy logic in Spark.

I think this would be something good for us to consider - but it's probably 
going to take awhile (and certainly not in time for 2.1.0).

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-17788:

Target Version/s:   (was: 2.1.0)

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:14 PM:
---

After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?
According to wikipedia, I can't see any reason the pdf could be larger than 1.
Maybe I missed something.

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf 
will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?
According to wikipedia, I can't see any reason the pdf could be larger than 1.

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf 
will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> 

[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:09 PM:
---

After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?
According to wikipedia, I can't see any reason the pdf could be larger than 1.

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf 
will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf 
will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix 

[jira] [Commented] (SPARK-18554) leader master lost the leadership, when the slave become master, the perivious app's state display as waitting

2016-11-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695947#comment-15695947
 ] 

Saisai Shao commented on SPARK-18554:
-

Nothing blocked actually, just no one reviewed that PR, also it is not a big 
issue (only the state is not showing correctly).

> leader master lost the leadership, when the slave become master, the 
> perivious app's state display as waitting
> --
>
> Key: SPARK-18554
> URL: https://issues.apache.org/jira/browse/SPARK-18554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Web UI
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>Priority: Minor
>
> when the leader of master lost the leadship and the slave become master, the 
> state of app in the webui will display waiting; this code as follow
>  case MasterChangeAcknowledged(appId) => {
>   idToApp.get(appId) match {
> case Some(app) =>
>   logInfo("Application has been re-registered: " + appId)
>   app.state = ApplicationState.WAITING
> case None =>
>   logWarning("Master change ack from unknown app: " + appId)
>   }
>   if (canCompleteRecovery) { completeRecovery() }
> the state of app should be RUNNING instead of waiting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:08 PM:
---

After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If [ v.t * v * -0.5 ] is a small negative number for a given x, then the logpdf 
will be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will 
be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be 

[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:07 PM:
---

After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (same variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will 
be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will 
be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very 

[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:06 PM:
---

After reading the code comments, I find it takes into consideration the 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will 
be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration on 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will 
be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very 

[jira] [Updated] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-18108:

Component/s: (was: Spark Core)
 SQL

> Partition discovery fails with explicitly written long partitions
> -
>
> Key: SPARK-18108
> URL: https://issues.apache.org/jira/browse/SPARK-18108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Richard Moorhead
>Priority: Minor
> Attachments: stacktrace.out
>
>
> We have parquet data written from Spark1.6 that, when read from 2.0.1, 
> produces errors.
> {code}
> case class A(a: Long, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The above code fails; stack trace attached. 
> If an integer used, explicit partition discovery succeeds.
> {code}
> case class A(a: Int, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The action succeeds. Additionally, if 'partitionBy' is used instead of 
> explicit writes, partition discovery succeeds. 
> Question: Is the first example a reasonable use case? 
> [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
>  seems to default to Integer types unless the partition value exceeds the 
> integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:05 PM:
---

After reading the code comments, I find it takes into consideration on 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  {code}v.t * v * -0.5{code} is a small negative number, then the logpdf will 
be about 6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration on 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  `v.t * v * -0.5` is a small negative number, then the logpdf will be about 
6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to 

[jira] [Comment Edited] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren edited comment on SPARK-18581 at 11/25/16 2:04 PM:
---

After reading the code comments, I find it takes into consideration on 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If  `v.t * v * -0.5` is a small negative number, then the logpdf will be about 
6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10



was (Author: invkrh):
After reading the code comments, I find it takes into consideration on 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If { v.t * v * -0.5 } is a small negative number, then the logpdf will be about 
6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, 

[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695942#comment-15695942
 ] 

Hao Ren commented on SPARK-18581:
-

After reading the code comments, I find it takes into consideration on 
degenerate case of multivariate normal distribution:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case
I agree that the covariance matrix need not be invertible.

However, the pdf of gaussian should always be smaller than 1, shouldn't it ?

Let's focus on the MultivariateGaussian.calculateCovarianceConstants() function:

The problem I faced is that my covariance matrix gives an eigenvalue vector 'd' 
as the following:

DenseVector(2.7681862718766402E-17, 9.204832153027098E-5, 8.995053289618483E-4, 
0.0030052504431952055, 0.006867041289040775, 0.030351586260721354, 
0.03499956314691966, 0.04128248388411499, 0.055530636656481766, 
0.09840067120993062, 0.13259027660865316, 0.16729084354080376, 
0.18807175387781094, 0.19009666915093745, 0.19065188805766764, 
0.19116928711151343, 0.19218984168511, 0.22044130291811304, 
0.23164643534046853, 0.32957890755845165, 0.4557354551695869, 
0.639320905646873, 0.8327082373125074, 1.7966679300383896, 2.5790389754725234)

Meanwhile, the non-zero tolerance = 1.8514678433708895E-13

thus, 
{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

logPseudoDetSigma = -58.40781006437829

-0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma) = 
6.230441702072326 = u (this variable name in the code)

Knowing that

{code}
  private[mllib] def logpdf(x: BV[Double]): Double = {
val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5 // u is used here
  }
{code}

If { v.t * v * -0.5 } is a small negative number, then the logpdf will be about 
6 => pdf = exp(6) = 403.4287934927351

In the gaussian mixture model case, some of the gaussian distributions could 
have a 'u' value much bigger, which results in a pdf = 2E10


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695941#comment-15695941
 ] 

holdenk commented on SPARK-18128:
-

Thanks! :) I'll start working on this issue once we start work on 2.2 :)

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695939#comment-15695939
 ] 

holdenk commented on SPARK-18128:
-

Thanks! :) I'll start working on this issue once we start work on 2.2 :)

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18405) Add yarn-cluster mode support to Spark Thrift Server

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695936#comment-15695936
 ] 

holdenk commented on SPARK-18405:
-

Even in cluster mode you could overwhelm the node running the thriftserver 
interface, or are you proposing to expose the thrifserver interface on multiple 
nodes not just the driver program?

> Add yarn-cluster mode support to Spark Thrift Server
> 
>
> Key: SPARK-18405
> URL: https://issues.apache.org/jira/browse/SPARK-18405
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.2, 2.0.0, 2.0.1
>Reporter: Prabhu Kasinathan
>  Labels: Spark, ThriftServer2
>
> Currently, spark thrift server can run only on yarn-client mode. 
> Can we add Yarn-Cluster mode support to spark thrift server?
> This will help us to launch multiple spark thrift server with different spark 
> configurations and it really help in large distributed clusters where there 
> is requirement to run complex sqls through STS. With client mode, there is a 
> chance to overload local host with too much driver memory. 
> Please let me know your thoughts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-18502:

Component/s: (was: Spark Core)
 SQL

> Spark does not handle columns that contain backquote (`)
> 
>
> Key: SPARK-18502
> URL: https://issues.apache.org/jira/browse/SPARK-18502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Barry Becker
>Priority: Minor
>
> I know that if a column contains dots or hyphens we can put 
> backquotes/backticks around it, but what if the column contains a backtick 
> (`)? Can the back tick be escaped by some means?
> Here is an example of the sort of error I see
> {code}
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90)
>  org.apache.spark.sql.Column.(Column.scala:113) 
> org.apache.spark.sql.Column$.apply(Column.scala:36) 
> org.apache.spark.sql.functions$.min(functions.scala:407) 
> com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158)
>  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18532) Code generation memory issue

2016-11-25 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-18532:

Component/s: (was: Spark Core)
 SQL

> Code generation memory issue
> 
>
> Key: SPARK-18532
> URL: https://issues.apache.org/jira/browse/SPARK-18532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: osx / macbook pro / local spark
>Reporter: Georg Heiler
>
> Trying to create a spark data frame with multiple additional columns based on 
> conditions like this
> df
> .withColumn("name1", someCondition1)
> .withColumn("name2", someCondition2)
> .withColumn("name3", someCondition3)
> .withColumn("name4", someCondition4)
> .withColumn("name5", someCondition5)
> .withColumn("name6", someCondition6)
> .withColumn("name7", someCondition7)
> I am faced with the following exception in case more than 6 .withColumn 
> clauses are added
> org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> f even more columns are created e.g. around 20 I do no longer receive the 
> aforementioned exception, but rather get the following error after 5 minutes 
> of waiting:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> What I want to perform is a spelling/error correction. some simple cases 
> could be handled easily via a map& replacement in a UDF. Still, several other 
> cases with multiple chained conditions remain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2016-11-25 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695922#comment-15695922
 ] 

holdenk commented on SPARK-18541:
-

Making it easier for PySpark SQL users to specify metadata sounds 
interesting/useful. I'd probably try and choose something closer to the scala 
API (e.g. implement `as` instead of `aliasWithMetadata`). What do [~davies] / 
[~marmbrus] think?

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Priority: Minor
>  Labels: newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18538) Concurrent Fetching DataFrameReader JDBC APIs Do Not Work

2016-11-25 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18538:
--
Priority: Blocker  (was: Critical)

> Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
> -
>
> Key: SPARK-18538
> URL: https://issues.apache.org/jira/browse/SPARK-18538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   columnName: String,
>   lowerBound: Long,
>   upperBound: Long,
>   numPartitions: Int,
>   connectionProperties: Properties): DataFrame
> {code}
> {code}
>   def jdbc(
>   url: String,
>   table: String,
>   predicates: Array[String],
>   connectionProperties: Properties): DataFrame
> {code}
> The above two DataFrameReader JDBC APIs ignore the user-specified parameters 
> of parallelism degree



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18356:
--
Assignee: zakaria hili

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Assignee: zakaria hili
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.2.0
>
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18356.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15965
[https://github.com/apache/spark/pull/15965]

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Assignee: zakaria hili
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.2.0
>
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18565) subtractByKey modifes values in the source RDD

2016-11-25 Thread Dmitry Dzhus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Dzhus resolved SPARK-18565.
--
Resolution: Invalid

> subtractByKey modifes values in the source RDD
> --
>
> Key: SPARK-18565
> URL: https://issues.apache.org/jira/browse/SPARK-18565
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: Amazon Elastic MapReduce (emr-5.2.0)
>Reporter: Dmitry Dzhus
>
> I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala 
> 2.11.x:
> Relevant code:
> {code}
> object Types {
>type ContentId = Int
>type ContentKey = Tuple2[Int, ContentId]
>type InternalContentId = Int
> }
> val inverseItemIDMap: RDD[(InternalContentId, ContentKey)] = 
> itemIDMap.map(_.swap).cache()
> logger.info(s"Built an inverse map of ${inverseItemIDMap.count()} item IDs")
> logger.info(inverseItemIDMap.collect().mkString("I->E ", "\nI->E ", ""))
> val superfluousItems: RDD[(InternalContentId, Int)] = .. .cache()
> logger.info(superfluousItems.collect().mkString("SI ", "\nSI ", ""))
> val filteredInverseItemIDMap: RDD[(InternalContentId, ContentKey)] = 
>   inverseItemIDMap.subtractByKey(superfluousItems).cache() // <<===!!!
> logger.info(s"${filteredInverseItemIDMap.count()} items in the filtered 
> inverse ID mapping")
> logger.info(filteredInverseItemIDMap.collect().mkString("F I->E ", "\nF I->E 
> ", ""))
> {code}
> The operation in question is {{.subtractByKey}}. Both RDDs involved are 
> cached and forced via {{count()}} prior to calling {{subtractByKey}}, so I 
> would expect the result to be unaffected by how exactly superfluousItems is 
> built.
> I added debugging output and filtered the resulting logs by relevant 
> InternalContentId values (829911, 830071). Output:
> {code}
> Built an inverse map of 827354 item IDs
> .
> .
> I->E (829911,(2,1135081))
> I->E (830071,(1,2295102))
> .
> .
> 748190 items in the training set had less than 28 ratings
> SI (829911,3)
> .
> .
> 79164 items in the filtered inverse ID mapping
> F I->E (830071,(2,1135081))
> {code}
> There's no element with key 830071 in {{superfluousItems}} (SI), so it's not 
> removed from the source RDD. However, its value is for some reason replaced 
> with the one from key 829911. How could this be? I cannot reproduce it 
> locally - only when running on a multi-machine cluster. Is this a bug or I'm 
> missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18565) subtractByKey modifes values in the source RDD

2016-11-25 Thread Dmitry Dzhus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695809#comment-15695809
 ] 

Dmitry Dzhus commented on SPARK-18565:
--

The problem was that I assumed that caching and forcing an RDD guarantees that 
it will never be re-evaluated. inverseItemIDMap is built using a 
non-determenistic operation, and multiple uses of it also give different 
results:

{code}
val itemIDMap: RDD[(ContentKey, InternalContentId)] =
  rawEvents
  .map(_.content)
  .distinct
  .zipWithUniqueId()
  .map(u => (u._1, u._2.toInt))
  .cache()
logger.info(s"Built a map of ${itemIDMap.count()} item IDs")
val inverseItemIDMap: RDD[(InternalContentId, ContentKey)] =
  itemIDMap.map(_.swap).cache()
{code}

I made the operation stable by adding {{.sortBy(c => c)}} and this solved the 
issue.

> subtractByKey modifes values in the source RDD
> --
>
> Key: SPARK-18565
> URL: https://issues.apache.org/jira/browse/SPARK-18565
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: Amazon Elastic MapReduce (emr-5.2.0)
>Reporter: Dmitry Dzhus
>
> I'm experiencing a problem with subtractByKey using Spark 2.0.2 with Scala 
> 2.11.x:
> Relevant code:
> {code}
> object Types {
>type ContentId = Int
>type ContentKey = Tuple2[Int, ContentId]
>type InternalContentId = Int
> }
> val inverseItemIDMap: RDD[(InternalContentId, ContentKey)] = 
> itemIDMap.map(_.swap).cache()
> logger.info(s"Built an inverse map of ${inverseItemIDMap.count()} item IDs")
> logger.info(inverseItemIDMap.collect().mkString("I->E ", "\nI->E ", ""))
> val superfluousItems: RDD[(InternalContentId, Int)] = .. .cache()
> logger.info(superfluousItems.collect().mkString("SI ", "\nSI ", ""))
> val filteredInverseItemIDMap: RDD[(InternalContentId, ContentKey)] = 
>   inverseItemIDMap.subtractByKey(superfluousItems).cache() // <<===!!!
> logger.info(s"${filteredInverseItemIDMap.count()} items in the filtered 
> inverse ID mapping")
> logger.info(filteredInverseItemIDMap.collect().mkString("F I->E ", "\nF I->E 
> ", ""))
> {code}
> The operation in question is {{.subtractByKey}}. Both RDDs involved are 
> cached and forced via {{count()}} prior to calling {{subtractByKey}}, so I 
> would expect the result to be unaffected by how exactly superfluousItems is 
> built.
> I added debugging output and filtered the resulting logs by relevant 
> InternalContentId values (829911, 830071). Output:
> {code}
> Built an inverse map of 827354 item IDs
> .
> .
> I->E (829911,(2,1135081))
> I->E (830071,(1,2295102))
> .
> .
> 748190 items in the training set had less than 28 ratings
> SI (829911,3)
> .
> .
> 79164 items in the filtered inverse ID mapping
> F I->E (830071,(2,1135081))
> {code}
> There's no element with key 830071 in {{superfluousItems}} (SI), so it's not 
> removed from the source RDD. However, its value is for some reason replaced 
> with the one from key 829911. How could this be? I cannot reproduce it 
> locally - only when running on a multi-machine cluster. Is this a bug or I'm 
> missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-25 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695800#comment-15695800
 ] 

Genmao Yu commented on SPARK-18512:
---

[~giuseppe.bonaccorso] How can i reproduce this failure?  Can u provide some 
code   snippet?

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A with read-after-write consistency)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator

2016-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695645#comment-15695645
 ] 

Apache Spark commented on SPARK-17251:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16012

> "ClassCastException: OuterReference cannot be cast to NamedExpression" for 
> correlated subquery on the RHS of an IN operator
> ---
>
> Key: SPARK-17251
> URL: https://issues.apache.org/jira/browse/SPARK-17251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> The following test case produces a ClassCastException in the analyzer:
> {code}
> CREATE TABLE t1(a INTEGER);
> INSERT INTO t1 VALUES(1),(2);
> CREATE TABLE t2(b INTEGER);
> INSERT INTO t2 VALUES(1);
> SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2);
> {code}
> Here's the exception:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.NamedExpression
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> 

[jira] [Assigned] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator

2016-11-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17251:


Assignee: Apache Spark

> "ClassCastException: OuterReference cannot be cast to NamedExpression" for 
> correlated subquery on the RHS of an IN operator
> ---
>
> Key: SPARK-17251
> URL: https://issues.apache.org/jira/browse/SPARK-17251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The following test case produces a ClassCastException in the analyzer:
> {code}
> CREATE TABLE t1(a INTEGER);
> INSERT INTO t1 VALUES(1),(2);
> CREATE TABLE t2(b INTEGER);
> INSERT INTO t2 VALUES(1);
> SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2);
> {code}
> Here's the exception:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.NamedExpression
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
>   at 

[jira] [Assigned] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator

2016-11-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17251:


Assignee: (was: Apache Spark)

> "ClassCastException: OuterReference cannot be cast to NamedExpression" for 
> correlated subquery on the RHS of an IN operator
> ---
>
> Key: SPARK-17251
> URL: https://issues.apache.org/jira/browse/SPARK-17251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>
> The following test case produces a ClassCastException in the analyzer:
> {code}
> CREATE TABLE t1(a INTEGER);
> INSERT INTO t1 VALUES(1),(2);
> CREATE TABLE t2(b INTEGER);
> INSERT INTO t2 VALUES(1);
> SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2);
> {code}
> Here's the exception:
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.NamedExpression
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48)
>   at 
> scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
>   at scala.collection.immutable.List.exists(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
>   at 
> 

[jira] [Updated] (SPARK-18584) multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18584:
--
Target Version/s:   (was: 2.0.2)
   Fix Version/s: (was: 2.0.2)

> multiple Spark Thrift Servers running in the same machine throws 
> org.apache.hadoop.security.AccessControlException
> --
>
> Key: SPARK-18584
> URL: https://issues.apache.org/jira/browse/SPARK-18584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: hadoop-2.5.0-cdh5.2.1-och4.0.0
> spark2.0.2
>Reporter: tanxinz
>
> In spark2.0.2 , I have two users(etl , dev ) start Spark Thrift Server in the 
> same machine . I connected by beeline etl STS to execute a command,and 
> throwed org.apache.hadoop.security.AccessControlException.I don't know why is 
> dev user perform,not etl.
> ```
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denied: user=dev, access=EXECUTE, 
> inode="/user/hive/warehouse/tb_spark_sts/etl_cycle_id=20161122":etl:supergroup:drwxr-x---,group:etl:rwx,group:oth_dev:rwx,default:user:data_mining:r-x,default:group::rwx,default:group:etl:rwx,default:group:oth_dev:rwx,default:mask::rwx,default:other::---
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkAccessAcl(DefaultAuthorizationProvider.java:335)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:231)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:178)
> at 
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:137)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6250)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3942)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:811)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3359:
-
Assignee: Hyukjin Kwon

> `sbt/sbt unidoc` doesn't work with Java 8
> -
>
> Key: SPARK-3359
> URL: https://issues.apache.org/jira/browse/SPARK-3359
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> It seems that Java 8 is stricter on JavaDoc. I got many error messages like
> {code}
> [error] 
> /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
>  error: modifier private not allowed here
> [error] private abstract interface SparkHadoopMapRedUtil {
> [error]  ^
> {code}
> This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16151) Make generated params non-final

2016-11-25 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695614#comment-15695614
 ] 

zhengruifeng commented on SPARK-16151:
--

Some param shoud be made non-final : such as {{setSolver}} [SPARK-18518]   
{{setStepSize}} [https://github.com/apache/spark/pull/15913]
I can work on this.


> Make generated params non-final
> ---
>
> Key: SPARK-16151
> URL: https://issues.apache.org/jira/browse/SPARK-16151
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> Currently, all generated param instances are final. There are several 
> scenarios where we need them to be non-final:
> 1. We don't have a guideline about where we should document the param doc. 
> Some were inherited from the generated param traits directly, while some were 
> documented in the setters if we want to make changes. I think it is better to 
> have all documented in the param instances, which appear at the top the 
> generated API doc. However, this requires inherit the param instance.
> {code}
> /**
>  * new doc
>  */
> val param: Param = super.param
> {code}
> We can use `@inherit_doc` if we just want to add a few words, e.g., the 
> default value, to the inherited doc.
> 2. We might want to update the embedded doc in the param instance.
> Opened this JIRA for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695552#comment-15695552
 ] 

Sean Owen commented on SPARK-18581:
---

Yes, but it need not be invertible, for the reason you give. It looks like it 
handles this in the code. pinvS is a pseudo-inverse of the eigenvalue diagonal 
matrix, which can have zeroes.

Backing up though, I re-read and see you're saying you get a PDF > 1, but, 
that's perfectly normal. PDF does not need to be <= 1.

Are you, however, saying you observe a big numeric inaccuracy in this case?

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695481#comment-15695481
 ] 

Hao Ren commented on SPARK-18581:
-

[~srowen]
I have updated the description. The problem is that my covariance matrix is non 
invertible, since one of the features is zero for all data points.

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-18581:

Description: 
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. 

In my case, I have a feature = 0 for all data point.
As a result, covariance matrix is not invertible <=> det(covariance matrix) = 0 
=> pseudo-determinant will be very close to zero,
Thus, log(pseudo-determinant) will be a large negative number which finally 
make logpdf very biger, pdf will be even bigger > 1.

As said in comments of MultivariateGaussian.scala, 

"""
Singular values are considered to be non-zero only if they exceed a tolerance 
based on machine precision.
"""

But if a singular value is considered to be zero, means the covariance matrix 
is non invertible which is a contradiction to the assumption that it should be 
invertible.

So we should check if there a single value is smaller than the tolerance before 
computing the pseudo determinant



  was:
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf could be positive, thus pdf > 1

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.




> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. 
> In my case, I have a feature = 0 for all data point.
> As a result, covariance matrix is not invertible <=> det(covariance matrix) = 
> 0 => pseudo-determinant will be very close to zero,
> Thus, log(pseudo-determinant) will be a large negative number which finally 
> make logpdf very biger, pdf will be even bigger > 1.
> As said in comments of MultivariateGaussian.scala, 
> """
> Singular values are considered to be non-zero only if they exceed a tolerance 
> based on machine precision.
> """
> But if a singular value is considered to be zero, means the covariance matrix 
> is non invertible which is a contradiction to the assumption that it should 
> be invertible.
> So we should check if there a single value is smaller than the tolerance 
> before computing the pseudo determinant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-11-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695473#comment-15695473
 ] 

Yanbo Liang commented on SPARK-18318:
-

Finished reviewing for all classes which were added/changed after 2.0, open 
SPARK-18587 and SPARK-18481 for two major issues, and address other minor 
issues at PR #16009. Thanks.

> ML, Graph 2.1 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-18318
> URL: https://issues.apache.org/jira/browse/SPARK-18318
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-18587:
---

Assignee: Yanbo Liang

> Remove handleInvalid from QuantileDiscretizer
> -
>
> Key: SPARK-18587
> URL: https://issues.apache.org/jira/browse/SPARK-18587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
>
> HandleInvalid only happens when {{Bucketizer}} transforming a dataset which 
> contains NaN, however, when the training dataset containing NaN, 
> {{QuantileDiscretizer}} will always ignore them. So we should keep 
> {{handleInvalid}} in {{Bucketizer}} and remove it from 
> {{QuantileDiscretizer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18481:

Description: 
Remove deprecated methods for ML.

We removed the following public APIs in this JIRA:
org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees
org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol
org.apache.spark.ml.regression.LinearRegressionSummary.model
org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees


  was:
Remove deprecated methods for ML.

We removed the following public APIs:
org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees
org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol
org.apache.spark.ml.regression.LinearRegressionSummary.model
org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees



> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Remove deprecated methods for ML.
> We removed the following public APIs in this JIRA:
> org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees
> org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol
> org.apache.spark.ml.regression.LinearRegressionSummary.model
> org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18481:

Description: 
Remove deprecated methods for ML.

We removed the following public APIs:
org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees
org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol
org.apache.spark.ml.regression.LinearRegressionSummary.model
org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees


  was:
Remove deprecated methods for ML.

We removed 


> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Remove deprecated methods for ML.
> We removed the following public APIs:
> org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees
> org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol
> org.apache.spark.ml.regression.LinearRegressionSummary.model
> org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18481:

Priority: Major  (was: Minor)

> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Remove deprecated methods for ML.
> We removed 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18481:

Priority: Minor  (was: Major)

> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Remove deprecated methods for ML.
> We removed 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18481:

Description: 
Remove deprecated methods for ML.

We removed 

  was:Remove deprecated methods for ML.


> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Remove deprecated methods for ML.
> We removed 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18548) OnlineLDAOptimizer reads the same broadcast data after deletion

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18548.
---
Resolution: Duplicate

> OnlineLDAOptimizer reads the same broadcast data after deletion
> ---
>
> Key: SPARK-18548
> URL: https://issues.apache.org/jira/browse/SPARK-18548
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In submitMiniBatch() called by OnlineLDAOptimizer, broadcast variable 
> expElogbeta is deleted before its use in the second time, which causes the 
> executor reads the same large broadcast data twice. I suggest to move the 
> broadcast data deletion (expElogbetaBc.unpersist()) later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18586) netty-3.8.0.Final.jar has vulnerability CVE-2014-3488 and CVE-2014-0193

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18586:
--
Priority: Major  (was: Critical)

Spark doesn't use netty 3, but it is pulled in as a transitive dependency. We 
can't get rid of it, but, it also isn't even necessarily exposed. 
Do these CVEs even affect Spark? We can try managing the version up to 3.8.3 to 
resolve one, or 3.9.x to resolve both, but this won't change the version of 
Netty that ends up on the classpath if deploying on an existing cluster.

> netty-3.8.0.Final.jar has vulnerability CVE-2014-3488  and CVE-2014-0193
> 
>
> Key: SPARK-18586
> URL: https://issues.apache.org/jira/browse/SPARK-18586
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: meiyoula
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18119) Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18119:
--
Assignee: Nicolas Fraison

> Namenode safemode check is only performed on one namenode which can stuck the 
> startup of SparkHistory server
> 
>
> Key: SPARK-18119
> URL: https://issues.apache.org/jira/browse/SPARK-18119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
> Environment: HDFS cdh 5.5.0 with HA namenode
>Reporter: Nicolas Fraison
>Assignee: Nicolas Fraison
>Priority: Minor
> Fix For: 2.1.0
>
>
> SparkHistory server startup is stuck when one of the 2 HA namenode is in 
> safemode displaying this log message: HDFS is still in safe mode. Waiting... 
> This happens even if one of the 2 namenode is in active mode because it only 
> request the first one of 2 available namenode in an HA configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18119) Namenode safemode check is only performed on one namenode which can stuck the startup of SparkHistory server

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18119.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15648
[https://github.com/apache/spark/pull/15648]

> Namenode safemode check is only performed on one namenode which can stuck the 
> startup of SparkHistory server
> 
>
> Key: SPARK-18119
> URL: https://issues.apache.org/jira/browse/SPARK-18119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
> Environment: HDFS cdh 5.5.0 with HA namenode
>Reporter: Nicolas Fraison
>Priority: Minor
> Fix For: 2.1.0
>
>
> SparkHistory server startup is stuck when one of the 2 HA namenode is in 
> safemode displaying this log message: HDFS is still in safe mode. Waiting... 
> This happens even if one of the 2 namenode is in active mode because it only 
> request the first one of 2 available namenode in an HA configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide

2016-11-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-18324:
---

Assignee: Yanbo Liang

> ML, Graph 2.1 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-18324
> URL: https://issues.apache.org/jira/browse/SPARK-18324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18554) leader master lost the leadership, when the slave become master, the perivious app's state display as waitting

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18554.
---
Resolution: Duplicate

[~jerryshao] is anything blocking you from proceeding on that PR?

> leader master lost the leadership, when the slave become master, the 
> perivious app's state display as waitting
> --
>
> Key: SPARK-18554
> URL: https://issues.apache.org/jira/browse/SPARK-18554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Web UI
>Affects Versions: 1.6.1
> Environment: java1.8
>Reporter: liujianhui
>Priority: Minor
>
> when the leader of master lost the leadship and the slave become master, the 
> state of app in the webui will display waiting; this code as follow
>  case MasterChangeAcknowledged(appId) => {
>   idToApp.get(appId) match {
> case Some(app) =>
>   logInfo("Application has been re-registered: " + appId)
>   app.state = ApplicationState.WAITING
> case None =>
>   logWarning("Master change ack from unknown app: " + appId)
>   }
>   if (canCompleteRecovery) { completeRecovery() }
> the state of app should be RUNNING instead of waiting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695359#comment-15695359
 ] 

Sean Owen commented on SPARK-18581:
---

It sounds like there's definitely a problem here but why is the 
pseudo-determinant the issue? it's possible for eigenvalues to be in (0,1) in 
which case the sum of logs is negative. Is that a problem per se?

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. However, if the eigen value is all between 0 and 1, 
> log(pseudo-determinant) will be a negative number like,  -50. As a result, 
> the logpdf could be positive, thus pdf > 1
> The related code that the following:
> // In function: MultivariateGaussian.calculateCovarianceConstants
> {code}
> val logPseudoDetSigma = d.activeValuesIterator.filter(_ > 
> tol).map(math.log).sum
> {code}
> d is the eigen value vector here. If lots of its elements are between 0 and 
> 1, then logPseudoDetSigma could be negative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18575) Keep same style: adjust the position of driver log links

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18575:
--
Assignee: Genmao Yu
Target Version/s:   (was: 2.0.3, 2.1.0)
Priority: Trivial  (was: Minor)

> Keep same style: adjust the position of driver log links
> 
>
> Key: SPARK-18575
> URL: https://issues.apache.org/jira/browse/SPARK-18575
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Trivial
> Fix For: 2.1.0
>
>
> {{NOT BUG}}, just adjust the position of driver log link to keep the same 
> style with other executors log link.
> !https://cloud.githubusercontent.com/assets/7402327/20590092/f8bddbb8-b25b-11e6-9aaf-3b5b3073df10.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18575) Keep same style: adjust the position of driver log links

2016-11-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18575.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16001
[https://github.com/apache/spark/pull/16001]

> Keep same style: adjust the position of driver log links
> 
>
> Key: SPARK-18575
> URL: https://issues.apache.org/jira/browse/SPARK-18575
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{NOT BUG}}, just adjust the position of driver log link to keep the same 
> style with other executors log link.
> !https://cloud.githubusercontent.com/assets/7402327/20590092/f8bddbb8-b25b-11e6-9aaf-3b5b3073df10.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-18581:

Description: 
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf could be positive, thus pdf > 1

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.



  was:
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf could be positive, thus pdf > 1

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants()

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.




> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. However, if the eigen value is all between 0 and 1, 
> log(pseudo-determinant) will be a negative number like,  -50. As a result, 
> the logpdf could be positive, thus pdf > 1
> The related code that the following:
> // In function: MultivariateGaussian.calculateCovarianceConstants
> {code}
> val logPseudoDetSigma = d.activeValuesIterator.filter(_ > 
> tol).map(math.log).sum
> {code}
> d is the eigen value vector here. If lots of its elements are between 0 and 
> 1, then logPseudoDetSigma could be negative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-18581:

Description: 
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf could be positive, thus pdf > 1

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants()

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.



  was:
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf becomes positive (pdf > 1)

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants()

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.




> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. However, if the eigen value is all between 0 and 1, 
> log(pseudo-determinant) will be a negative number like,  -50. As a result, 
> the logpdf could be positive, thus pdf > 1
> The related code that the following:
> // In function: MultivariateGaussian.calculateCovarianceConstants()
> {code}
> val logPseudoDetSigma = d.activeValuesIterator.filter(_ > 
> tol).map(math.log).sum
> {code}
> d is the eigen value vector here. If lots of its elements are between 0 and 
> 1, then logPseudoDetSigma could be negative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-18581:

Description: 
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf becomes positive (pdf > 1)

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants()

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.



  was:
When training GaussianMixtureModel, I found some probability much larger than 
1. That leads me to that fact that, the value returned by 
MultivariateGaussian.pdf can be 10^5, etc.

After reviewing the code, I found that problem lies in the computation of 
determinant of the covariance matrix.

The computation is simplified by using pseudo-determinant of a positive defined 
matrix. However, if the eigen value is all between 0 and 1, 
log(pseudo-determinant) will be a negative number like,  -50. As a result, the 
logpdf becomes positive (pdf > 1)

The related code that the following:

// In function: MultivariateGaussian.calculateCovarianceConstants()

{code}
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
{code}

d is the eigen value vector here. If lots of its elements are between 0 and 1, 
then logPseudoDetSigma could be negative.

Maybe we should just use the breeze 'det' opertion on sigma to get the right 
but slow answer instead of a quick, wrong one.


> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. However, if the eigen value is all between 0 and 1, 
> log(pseudo-determinant) will be a negative number like,  -50. As a result, 
> the logpdf becomes positive (pdf > 1)
> The related code that the following:
> // In function: MultivariateGaussian.calculateCovarianceConstants()
> {code}
> val logPseudoDetSigma = d.activeValuesIterator.filter(_ > 
> tol).map(math.log).sum
> {code}
> d is the eigen value vector here. If lots of its elements are between 0 and 
> 1, then logPseudoDetSigma could be negative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18581) MultivariateGaussian does not check if covariance matrix is invertible

2016-11-25 Thread Hao Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-18581:

Summary: MultivariateGaussian does not check if covariance matrix is 
invertible  (was: MultivariateGaussian returns pdf value larger than 1)

> MultivariateGaussian does not check if covariance matrix is invertible
> --
>
> Key: SPARK-18581
> URL: https://issues.apache.org/jira/browse/SPARK-18581
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Hao Ren
>
> When training GaussianMixtureModel, I found some probability much larger than 
> 1. That leads me to that fact that, the value returned by 
> MultivariateGaussian.pdf can be 10^5, etc.
> After reviewing the code, I found that problem lies in the computation of 
> determinant of the covariance matrix.
> The computation is simplified by using pseudo-determinant of a positive 
> defined matrix. However, if the eigen value is all between 0 and 1, 
> log(pseudo-determinant) will be a negative number like,  -50. As a result, 
> the logpdf becomes positive (pdf > 1)
> The related code that the following:
> // In function: MultivariateGaussian.calculateCovarianceConstants()
> {code}
> val logPseudoDetSigma = d.activeValuesIterator.filter(_ > 
> tol).map(math.log).sum
> {code}
> d is the eigen value vector here. If lots of its elements are between 0 and 
> 1, then logPseudoDetSigma could be negative.
> Maybe we should just use the breeze 'det' opertion on sigma to get the right 
> but slow answer instead of a quick, wrong one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer

2016-11-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18587:


Assignee: Apache Spark

> Remove handleInvalid from QuantileDiscretizer
> -
>
> Key: SPARK-18587
> URL: https://issues.apache.org/jira/browse/SPARK-18587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Critical
>
> HandleInvalid only happens when {{Bucketizer}} transforming a dataset which 
> contains NaN, however, when the training dataset containing NaN, 
> {{QuantileDiscretizer}} will always ignore them. So we should keep 
> {{handleInvalid}} in {{Bucketizer}} and remove it from 
> {{QuantileDiscretizer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer

2016-11-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15695291#comment-15695291
 ] 

Apache Spark commented on SPARK-18587:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16011

> Remove handleInvalid from QuantileDiscretizer
> -
>
> Key: SPARK-18587
> URL: https://issues.apache.org/jira/browse/SPARK-18587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Critical
>
> HandleInvalid only happens when {{Bucketizer}} transforming a dataset which 
> contains NaN, however, when the training dataset containing NaN, 
> {{QuantileDiscretizer}} will always ignore them. So we should keep 
> {{handleInvalid}} in {{Bucketizer}} and remove it from 
> {{QuantileDiscretizer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18587) Remove handleInvalid from QuantileDiscretizer

2016-11-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18587:


Assignee: (was: Apache Spark)

> Remove handleInvalid from QuantileDiscretizer
> -
>
> Key: SPARK-18587
> URL: https://issues.apache.org/jira/browse/SPARK-18587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Critical
>
> HandleInvalid only happens when {{Bucketizer}} transforming a dataset which 
> contains NaN, however, when the training dataset containing NaN, 
> {{QuantileDiscretizer}} will always ignore them. So we should keep 
> {{handleInvalid}} in {{Bucketizer}} and remove it from 
> {{QuantileDiscretizer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >