date:20170614

[jira] [Reopened] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayou Zhou reopened SPARK-21101:


> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050031#comment-16050031
 ] 

Dayou Zhou commented on SPARK-21101:


Hi [~maropu],

>> I'll close this because this seems to be a bug.

This sounds bizarre, maybe you meant it wasn't a bug, but anyway, I did not 
start by asking a question, I started by reporting an error which is probably a 
bug.  What is your justification that it is NOT a bug and what is your 
justification of closing it as 'not a problem' when you don't even seem to 
understand it?

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18016:
---

Assignee: Aleksander Eskilson

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>   at 
>

[jira] [Commented] (SPARK-20851) Drop spark table failed if a column name is a numeric string

2017-06-14 Thread Chen Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050028#comment-16050028
 ] 

Chen Gong commented on SPARK-20851:
---

[~benyuel] [~maropu] Thanks for your comments. Root cause for this problem 
haven't been figured out, I will keep researching on this.

Some clue I am thinking right now is it may be due to mysql database where 
stores metadata of spark tables.  

> Drop spark table failed if a column name is a numeric string
> 
>
> Key: SPARK-20851
> URL: https://issues.apache.org/jira/browse/SPARK-20851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: linux redhat
>Reporter: Chen Gong
>
> I tried to read a json file to a spark dataframe
> {noformat}
> df = spark.read.json('path.json')
> df.write.parquet('dataframe', compression='snappy')
> {noformat}
> However, there are some columns' names are numeric strings, such as 
> "989238883". Then I created spark sql table by using this
> {noformat}
> create table if not exists `a` using org.apache.spark.sql.parquet options 
> (path 'dataframe');  // It works well
> {noformat}
> But after created table, any operations, like select, drop table on this 
> table will raise the same exceptions below
> {noformat}
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array>,url:string,width:bigint>>,audit_id:bigint,author_id:bigint,body:string,brand_id:string,created_at:string,custom_ticket_fields:struct<49244727:string,51588527:string,51591767:string,51950848:string,51950868:string,51950888:string,51950928:string,52359587:string,55276747:string,56958227:string,57080067:string,57080667:string,57107727:string,57112447:string,57113207:string,57411128:string,57424648:string,57442588:string,62382188:string,74862088:string,74871788:string>,event_type:string,group_id:bigint,html_body:string,id:bigint,is_public:string,locale_id:string,organization_id:string,plain_body:string,previous_value:string,priority:string,public:boolean,rel:string,removed_tags:array,requester_id:bigint,satisfaction_probability:string,satisfaction_score:string,sla_policy:string,status:string,tags:array,ticket_form_id:string,type:string,via:string,via_reference_id:bigint>>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:785)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10$$anonfun$7.apply(HiveClientImpl.scala:365)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:365)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$10.apply(HiveClientImpl.scala:361)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:230)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:272)
>   at 
>

[jira] [Resolved] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18016.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18075
[https://github.com/apache/spark/pull/18075]

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
>

[jira] [Commented] (SPARK-20980) Rename the option `wholeFile` to `multiLine` for JSON and CSV

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050023#comment-16050023
 ] 

Apache Spark commented on SPARK-20980:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/18312

> Rename the option `wholeFile` to `multiLine` for JSON and CSV
> -
>
> Key: SPARK-20980
> URL: https://issues.apache.org/jira/browse/SPARK-20980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> The current option name `wholeFile` is misleading for CSV. Currently, it is 
> not representing a record per file. Actually, one file could have multiple 
> records. Thus, we should rename it. Now, the proposal is `multiLine`.
> To make it consistent, we need to rename the same option for JSON and fix the 
> issue in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050016#comment-16050016
 ] 

Saisai Shao commented on SPARK-21082:
-

That's fine if the storage memory is not enough to cache all the data, Spark 
still could handle this scenario without OOM. Base on the free memory to 
schedule the task is too scenario specific from my understanding.

[~tgraves] [~irashid] [~mridulm80] may have more thoughts on it. 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20980) Rename the option `wholeFile` to `multiLine` for JSON and CSV

2017-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20980.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18202
[https://github.com/apache/spark/pull/18202]

> Rename the option `wholeFile` to `multiLine` for JSON and CSV
> -
>
> Key: SPARK-20980
> URL: https://issues.apache.org/jira/browse/SPARK-20980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> The current option name `wholeFile` is misleading for CSV. Currently, it is 
> not representing a record per file. Actually, one file could have multiple 
> records. Thus, we should rename it. Now, the proposal is `multiLine`.
> To make it consistent, we need to rename the same option for JSON and fix the 
> issue in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21074) Parquet files are read fully even though only count() is requested

2017-06-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050005#comment-16050005
 ] 

Takeshi Yamamuro commented on SPARK-21074:
--

Since this is an expected behaviour and I think this is not a bug, I'll set 
"Improvement" in type.

> Parquet files are read fully even though only count() is requested
> --
>
> Key: SPARK-21074
> URL: https://issues.apache.org/jira/browse/SPARK-21074
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Spector
>
> I have the following sample code that creates parquet files:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .appName("Test Write").getOrCreate()
> val sqc = spark.sqlContext
> import sqc.implicits._
> val random = new scala.util.Random(31L)
> (1465720077 to 1465720077+1000).map(x => Event(x, random.nextString(2)))
>   .toDS()
>   .write
>   .mode(SaveMode.Overwrite)
>   .parquet("s3://my-bucket/test")
> {code}
> Afterwards, I'm trying to read these files with the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .config("spark.sql.parquet.filterPushdown", "true")
>   .appName("Test Read").getOrCreate()
> spark.sqlContext.read
>   .option("mergeSchema", "false")
>   .parquet("s3://my-bucket/test")
>   .count()
> {code}
> I've enabled DEBUG log level to see what requests are actually sent through 
> S3 API, and I've figured out that in addition to parquet "footer" retrieval 
> there are requests that ask for whole file content.
> For example, this is full content request example:
> {noformat}
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet 
> HTTP/1.1[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 
> 0-7472093/7472094[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 
> 7472094[\r][\n]"
> {noformat}
> And this is partial request example for footer only:
> {noformat}
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1
> 
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> Range: 
> bytes=7472086-7472094
> ...
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-2 << "Content-Length: 8[\r][\n]"
> 
> {noformat}
> Here's what FileScanRDD prints:
> {noformat}
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-4-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473020, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00011-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472503, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-6-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472501, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-7-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473104, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-3-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472458, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00012-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472594, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-1-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472984, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00014-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472720, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-8-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472339, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00015-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472437, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
>

[jira] [Updated] (SPARK-21074) Parquet files are read fully even though only count() is requested

2017-06-14 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21074:
-
Issue Type: Improvement  (was: Bug)

> Parquet files are read fully even though only count() is requested
> --
>
> Key: SPARK-21074
> URL: https://issues.apache.org/jira/browse/SPARK-21074
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Michael Spector
>
> I have the following sample code that creates parquet files:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .appName("Test Write").getOrCreate()
> val sqc = spark.sqlContext
> import sqc.implicits._
> val random = new scala.util.Random(31L)
> (1465720077 to 1465720077+1000).map(x => Event(x, random.nextString(2)))
>   .toDS()
>   .write
>   .mode(SaveMode.Overwrite)
>   .parquet("s3://my-bucket/test")
> {code}
> Afterwards, I'm trying to read these files with the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 
> "2")
>   .config("spark.hadoop.parquet.metadata.read.parallelism", "50")
>   .config("spark.sql.parquet.filterPushdown", "true")
>   .appName("Test Read").getOrCreate()
> spark.sqlContext.read
>   .option("mergeSchema", "false")
>   .parquet("s3://my-bucket/test")
>   .count()
> {code}
> I've enabled DEBUG log level to see what requests are actually sent through 
> S3 API, and I've figured out that in addition to parquet "footer" retrieval 
> there are requests that ask for whole file content.
> For example, this is full content request example:
> {noformat}
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet 
> HTTP/1.1[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 
> 0-7472093/7472094[\r][\n]"
> 
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 
> 7472094[\r][\n]"
> {noformat}
> And this is partial request example for footer only:
> {noformat}
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> GET 
> /test/part-0-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1
> 
> 17/06/13 05:46:50 DEBUG headers: http-outgoing-2 >> Range: 
> bytes=7472086-7472094
> ...
> 17/06/13 05:46:50 DEBUG wire: http-outgoing-2 << "Content-Length: 8[\r][\n]"
> 
> {noformat}
> Here's what FileScanRDD prints:
> {noformat}
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-4-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473020, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00011-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472503, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-6-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472501, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-7-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7473104, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-3-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472458, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00012-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472594, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-1-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472984, partition values: [empty row]
> 17/06/13 05:46:52 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00014-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472720, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-8-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472339, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00015-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472437, partition values: [empty row]
> 17/06/13 05:46:53 INFO FileScanRDD: Reading File path: 
> s3://my-bucket/test/part-00013-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet,
>  range: 0-7472312, partition values:

[jira] [Resolved] (SPARK-21092) Wire SQLConf in logical plan and expressions

2017-06-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-21092.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Wire SQLConf in logical plan and expressions
> 
>
> Key: SPARK-21092
> URL: https://issues.apache.org/jira/browse/SPARK-21092
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>
> It is really painful to not have configs in logical plan and expressions. We 
> had to add all sorts of hacks (e.g. pass SQLConf explicitly in functions). 
> This ticket exposes SQLConf in logical plan, using a thread local variable 
> and a getter closure that's set once there is an active SparkSession.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2017-06-14 Thread remoteServer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049979#comment-16049979
 ] 

remoteServer commented on SPARK-15905:
--

I faced the same issue. Increasing driver memory helped. You can take "jstat" 
and see the frequency of FGC. 

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-21101.

Resolution: Not A Problem

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049977#comment-16049977
 ] 

Takeshi Yamamuro edited comment on SPARK-21101 at 6/15/17 4:16 AM:
---

Since JIRA is not a place for questions, you better ask in spark-user. I'll 
close this because this seems to be a bug. If you find this is a bug, feel free 
to reopen this. Thanks.


was (Author: maropu):
Since JIRA is not a place for questions, you better ask in spark-user.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049977#comment-16049977
 ] 

Takeshi Yamamuro commented on SPARK-21101:
--

Since JIRA is not a place for questions, you better ask in spark-user.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-21082:

Affects Version/s: (was: 2.2.1)
   2.3.0

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049965#comment-16049965
 ] 

DjvuLee commented on SPARK-21082:
-

Data locality, input size for task, scheduling order affect a lot, even all the 
nodes have the same computation capacity.


Suppose there are two Executors with same computation capacity and four tasks 
with input size: 10G, 3G, 10G, 20G. 
So there is a chance that one Executor will cache 30GB, one will  cache 13GB 
under current scheduling policy。 
If the Executor have only 25GB memory for storage, then not all the data can be 
cached in memory.

I will give a more detail description for the propose if it seems OK now.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-06-14 Thread Li Yichao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049942#comment-16049942
 ] 

Li Yichao edited comment on SPARK-19900 at 6/15/17 3:29 AM:


My user name (and JIRA login name) is `lyc`, Full name is `Li Yichao`. It seems 
JIRA has a bug when my full name is the same as my user name, so I change my 
full name, now I can be mentioned by `@Li Yichao`


was (Author: lyc):
My user name (and JIRA login name) is `lyc`, Full name is `Li Yichao`. It seems 
JIRA has a bug when my full name is the same as my user name, so I change my 
full name, now I can be mentioned. [~lyc]

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
> Fix For: 2.3.0
>
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
>

[jira] [Comment Edited] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-06-14 Thread Li Yichao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049942#comment-16049942
 ] 

Li Yichao edited comment on SPARK-19900 at 6/15/17 3:26 AM:


My user name (and JIRA login name) is `lyc`, Full name is `Li Yichao`. It seems 
JIRA has a bug when my full name is the same as my user name, so I change my 
full name, now I can be mentioned. [~lyc]


was (Author: lyc):
Hi, what's the meaning of JIRA id? I only know that my user name (and JIRA 
login name) is `lyc`, is that what you want?

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
> Fix For: 2.3.0
>
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN

[jira] [Updated] (SPARK-20869) Master should clear failed apps when worker down

2017-06-14 Thread lyc (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lyc updated SPARK-20869:

Description: 
In `Master.removeWorker`, master clears executor and driver state, but does not 
clear app state. App state is cleared when received `UnregisterApplication` and 
when `onDisconnect`, the first is when driver shutdown gracefully, the second 
is called when `netty`'s `channelInActive` is called (which is called when 
channel is closed), both of which can not handle the case when there is a 
network partition between master and worker.

Follow the steps in 
[SPARK-19900|https://issues.apache.org/jira/browse/SPARK-19900], and see the 
[screenshots|https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png]
 when worker1 partitions with master, the app `app-xxx-000` is still running 
instead of finished because of worker1 is down.

cc [~CodingCat]


  was:
In `Master.removeWorker`, master clears executor and driver state, but does not 
clear app state. App state is cleared when received `UnregisterApplication` and 
when `onDisconnect`, the first is when driver shutdown gracefully, the second 
is called when `netty`'s `channelInActive` is called (which is called when 
channel is closed), both of which can not handle the case when there is a 
network partition between master and worker.

Follow the steps in 
[SPARK-19900|https://issues.apache.org/jira/browse/SPARK-19900], and see the 
[screenshots|https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png]
 when worker1 partitions with master, the app `app-xxx-000` is still running 
instead of finished because of worker1 is down.

cc [~CodingCat]
@lyc


> Master should clear failed apps when worker down
> 
>
> Key: SPARK-20869
> URL: https://issues.apache.org/jira/browse/SPARK-20869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: lyc
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In `Master.removeWorker`, master clears executor and driver state, but does 
> not clear app state. App state is cleared when received 
> `UnregisterApplication` and when `onDisconnect`, the first is when driver 
> shutdown gracefully, the second is called when `netty`'s `channelInActive` is 
> called (which is called when channel is closed), both of which can not handle 
> the case when there is a network partition between master and worker.
> Follow the steps in 
> [SPARK-19900|https://issues.apache.org/jira/browse/SPARK-19900], and see the 
> [screenshots|https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png]
>  when worker1 partitions with master, the app `app-xxx-000` is still running 
> instead of finished because of worker1 is down.
> cc [~CodingCat]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20869) Master should clear failed apps when worker down

2017-06-14 Thread lyc (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lyc updated SPARK-20869:

Description: 
In `Master.removeWorker`, master clears executor and driver state, but does not 
clear app state. App state is cleared when received `UnregisterApplication` and 
when `onDisconnect`, the first is when driver shutdown gracefully, the second 
is called when `netty`'s `channelInActive` is called (which is called when 
channel is closed), both of which can not handle the case when there is a 
network partition between master and worker.

Follow the steps in 
[SPARK-19900|https://issues.apache.org/jira/browse/SPARK-19900], and see the 
[screenshots|https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png]
 when worker1 partitions with master, the app `app-xxx-000` is still running 
instead of finished because of worker1 is down.

cc [~CodingCat]
@lyc

  was:
In `Master.removeWorker`, master clears executor and driver state, but does not 
clear app state. App state is cleared when received `UnregisterApplication` and 
when `onDisconnect`, the first is when driver shutdown gracefully, the second 
is called when `netty`'s `channelInActive` is called (which is called when 
channel is closed), both of which can not handle the case when there is a 
network partition between master and worker.

Follow the steps in 
[SPARK-19900|https://issues.apache.org/jira/browse/SPARK-19900], and see the 
[screenshots|https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png]
 when worker1 partitions with master, the app `app-xxx-000` is still running 
instead of finished because of worker1 is down.

cc [~CodingCat]


> Master should clear failed apps when worker down
> 
>
> Key: SPARK-20869
> URL: https://issues.apache.org/jira/browse/SPARK-20869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: lyc
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In `Master.removeWorker`, master clears executor and driver state, but does 
> not clear app state. App state is cleared when received 
> `UnregisterApplication` and when `onDisconnect`, the first is when driver 
> shutdown gracefully, the second is called when `netty`'s `channelInActive` is 
> called (which is called when channel is closed), both of which can not handle 
> the case when there is a network partition between master and worker.
> Follow the steps in 
> [SPARK-19900|https://issues.apache.org/jira/browse/SPARK-19900], and see the 
> [screenshots|https://cloud.githubusercontent.com/assets/2576762/26398697/d50735a4-40ac-11e7-80d8-6e9e1cf0b62f.png]
>  when worker1 partitions with master, the app `app-xxx-000` is still running 
> instead of finished because of worker1 is down.
> cc [~CodingCat]
> @lyc



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049944#comment-16049944
 ] 

Hyukjin Kwon commented on SPARK-21093:
--

I am taking a look here gdb with bt:

{code}
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /usr/lib64/R/bin/exec/R...Reading symbols from 
/usr/lib64/R/bin/exec/R...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 25284]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/R/bin/exec/R --slave --no-restore --vanilla 
--file=/home/hyukjinkwon'.
Program terminated with signal 6, Aborted.
#0  0x7fbdffb545f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install R-core-3.4.0-2.el7.x86_64
(gdb) where
#0  0x7fbdffb545f7 in raise () from /lib64/libc.so.6
#1  0x7fbdffb55ce8 in abort () from /lib64/libc.so.6
#2  0x7fbdffb94327 in __libc_message () from /lib64/libc.so.6
#3  0x7fbdffc2d597 in __fortify_fail () from /lib64/libc.so.6
#4  0x7fbdffc2b750 in __chk_fail () from /lib64/libc.so.6
#5  0x7fbdffc2d507 in __fdelt_warn () from /lib64/libc.so.6
#6  0x7fbdefca5015 in R_SockConnect () from 
/usr/lib64/R/modules//internet.so
#7  0x7fbdefcad81e in sock_open () from /usr/lib64/R/modules//internet.so
#8  0x7fbe026381b6 in do_sockconn () from /usr/lib64/R/lib/libR.so
#9  0x7fbe0268b4d0 in bcEval () from /usr/lib64/R/lib/libR.so
#10 0x7fbe0269b138 in Rf_eval () from /usr/lib64/R/lib/libR.so
#11 0x7fbe0269d1af in R_execClosure () from /usr/lib64/R/lib/libR.so
#12 0x7fbe0269b2f4 in Rf_eval () from /usr/lib64/R/lib/libR.so
#13 0x7fbe0269ef8e in do_set () from /usr/lib64/R/lib/libR.so
#14 0x7fbe0269b529 in Rf_eval () from /usr/lib64/R/lib/libR.so
#15 0x7fbe026a04ce in do_eval () from /usr/lib64/R/lib/libR.so
#16 0x7fbe0268b4d0 in bcEval () from /usr/lib64/R/lib/libR.so
#17 0x7fbe0269b138 in Rf_eval () from /usr/lib64/R/lib/libR.so
#18 0x7fbe0269d1af in R_execClosure () from /usr/lib64/R/lib/libR.so
#19 0x7fbe02694101 in bcEval () from /usr/lib64/R/lib/libR.so
#20 0x7fbe0269b138 in Rf_eval () from /usr/lib64/R/lib/libR.so
#21 0x7fbe0269ba7e in forcePromise () from /usr/lib64/R/lib/libR.so
#22 0x7fbe0269b7b7 in Rf_eval () from /usr/lib64/R/lib/libR.so
#23 0x7fbe026a06d1 in do_withVisible () from /usr/lib64/R/lib/libR.so
#24 0x7fbe026d02e9 in do_internal () from /usr/lib64/R/lib/libR.so
---Type  to continue, or q  to quit---
{code}

Another thing I found is, this looks actually reproduced in my Mac too. If the 
command above is executed multiple times (for my Mac, it has to be executed 
(14~16-ish times) but the error message looks apparently different. However, my 
wild guess is the root cause is the same. 


> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1

[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-06-14 Thread lyc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049942#comment-16049942
 ] 

lyc commented on SPARK-19900:
-

Hi, what's the meaning of JIRA id? I only know that my user name (and JIRA 
login name) is `lyc`, is that what you want?

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
> Fix For: 2.3.0
>
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07

[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049940#comment-16049940
 ] 

Saisai Shao commented on SPARK-21082:
-

Fast node actually equals to idle node, since fast node execute tasks more 
efficiently, so that it has more idle time to accept more tasks. Scheduler may 
not know which node is fast node, but it will always schedule tasks on to idle 
node (regardless of locality waiting), so as a result fast node will execute 
more tasks. 

What I mean fast nodes not only means much stronger CPU, it may be fast IO. 
Normally tasks should be relatively equal distributed, if you saw one Node has 
much more tasks compared to other nodes, you'd better find out the difference 
of that node from different aspects. Changing scheduler is not the first choice 
after all.

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049933#comment-16049933
 ] 

DjvuLee edited comment on SPARK-21082 at 6/15/17 2:47 AM:
--

Not a really fast node and slow node problem.

Even all the nodes have equal computation power, but there are lots of factor 
affect the data cached by Executors. Such as the data locality for the task's 
input, the network, and scheduling order etc.

`it is reasonable to schedule more tasks on to fast node.`
but the fact is schedule more tasks to ideal Executors. Scheduler has no 
meaning of fast or slow for each Executor, it considers more about locality and 
idle.

I agree that it is better not to change the code, but I can not find any 
configuration to solve the problem.
Is there any good solution to keep the used memory balanced across Executors? 


was (Author: djvulee):
Not a really fast node and slow node problem.

Even all the nodes have equal computation power, but there are lots of factor 
affect the data cached by Executors. Such as the data locality for the task's 
input, the network, and scheduling order etc.

`it is reasonable to schedule more tasks on to fast node.`
but the fact is schedule more tasks to ideal Executors. Scheduler has no 
meaning of fast or slow for each Executor, it considers more about locality and 
idle.

I agree that it is better not to change the code, but I can not find any 
configuration to solve the problem.
Is there any good solution to keep the used memory balanced across the 
Executors? 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049933#comment-16049933
 ] 

DjvuLee commented on SPARK-21082:
-

Not a really fast node and slow node problem.

Even all the nodes have equal computation power, but there are lots of factor 
affect the data cached by Executors. Such as the data locality for the task's 
input, the network, and scheduling order etc.

`it is reasonable to schedule more tasks on to fast node.`
but the fact is schedule more tasks to ideal Executors. Scheduler has no 
meaning of fast or slow for each Executor, it considers more about locality and 
idle.

I agree that it is better not to change the code, but I can not find any 
configuration to solve the problem.
Is there any good solution to keep the used memory balanced across the 
Executors? 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20912) map function with columns as strings

2017-06-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20912.
--
Resolution: Won't Fix

I am resolving this per the discussion in the PR. I guess we are fine with not 
adding this API for now.

> map function with columns as strings
> 
>
> Key: SPARK-20912
> URL: https://issues.apache.org/jira/browse/SPARK-20912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> There's only {{map}} function that accepts {{Column}} values only. It'd be 
> very helpful to have a variant that accepted {{String}} for columns like 
> {{array}} or {{struct}}.
> {code}
> scala> val kvs = Seq(("key", "value")).toDF("k", "v")
> kvs: org.apache.spark.sql.DataFrame = [k: string, v: string]
> scala> kvs.printSchema
> root
>  |-- k: string (nullable = true)
>  |-- v: string (nullable = true)
> scala> kvs.withColumn("map", map("k", "v")).show
> :26: error: type mismatch;
>  found   : String("k")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>  ^
> :26: error: type mismatch;
>  found   : String("v")
>  required: org.apache.spark.sql.Column
>kvs.withColumn("map", map("k", "v")).show
>   ^
> // note $ to create Columns per string
> // not very dev-friendly
> scala> kvs.withColumn("map", map($"k", $"v")).show
> +---+-+-+
> |  k|v|  map|
> +---+-+-+
> |key|value|Map(key -> value)|
> +---+-+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049920#comment-16049920
 ] 

Saisai Shao commented on SPARK-21080:
-

Are you getting this issue in HDFS HA mode? If yes, then updating to a newer 
version of HDFS should be enough, otherwise pulling the PR 
(https://github.com/apache/spark/pull/9168) to your Spark code and repackage 
should be worked.

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21103) QueryPlanConstraints should be part of LogicalPlan

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21103:


Assignee: Apache Spark  (was: Reynold Xin)

> QueryPlanConstraints should be part of LogicalPlan
> --
>
> Key: SPARK-21103
> URL: https://issues.apache.org/jira/browse/SPARK-21103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, 
> since the constraint framework is only used for query plan rewriting and not 
> for physical planning. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21103) QueryPlanConstraints should be part of LogicalPlan

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21103:


Assignee: Reynold Xin  (was: Apache Spark)

> QueryPlanConstraints should be part of LogicalPlan
> --
>
> Key: SPARK-21103
> URL: https://issues.apache.org/jira/browse/SPARK-21103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, 
> since the constraint framework is only used for query plan rewriting and not 
> for physical planning. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21103) QueryPlanConstraints should be part of LogicalPlan

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049919#comment-16049919
 ] 

Apache Spark commented on SPARK-21103:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18310

> QueryPlanConstraints should be part of LogicalPlan
> --
>
> Key: SPARK-21103
> URL: https://issues.apache.org/jira/browse/SPARK-21103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, 
> since the constraint framework is only used for query plan rewriting and not 
> for physical planning. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21103) QueryPlanConstraints should be part of LogicalPlan

2017-06-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-21103:
---

 Summary: QueryPlanConstraints should be part of LogicalPlan
 Key: SPARK-21103
 URL: https://issues.apache.org/jira/browse/SPARK-21103
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin


QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, 
since the constraint framework is only used for query plan rewriting and not 
for physical planning. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049915#comment-16049915
 ] 

Dayou Zhou commented on SPARK-21101:


Hi [~maropu], yes I saw this one, but in my case, I'm using JDBC Thrift server 
to invoke the UDTF, not using Spark-shell.  So is there a way to pass my JAR to 
the Thrift server?

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049906#comment-16049906
 ] 

Saisai Shao commented on SPARK-21082:
-

Is it due to fast node and slow node problem? Ideally if all the nodes have 
equal computation power, then the cached memory usage should be even. Here 
according to your description, it is more like a fast node and slow node 
problem, fast node will process and cache more data, it is reasonable to 
schedule more tasks on to fast node.

Based on free memory and OOM problem to schedule the tasks is quite scenario 
dependent AFAIK, actually we may have other solutions to tune the cluster 
instead of changing the code, also this scenario is not generic enough to 
change the scheduler. I would suggest to do a careful and generic design if you 
want to improve the scheduler. 

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.1
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049892#comment-16049892
 ] 

Takeshi Yamamuro commented on SPARK-21101:
--

See https://www.mail-archive.com/user@spark.apache.org/msg61009.html

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049891#comment-16049891
 ] 

Dayou Zhou commented on SPARK-21101:


Hi [~maropu]

>>You just don't pass your uber-jar into spark? 
Sorry not sure what you meant -- could you clarify your question?

>>Or, you mean the query above worked well on previous spark?
I did not try it with earlier versions, but likely the same behavior I think.

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049883#comment-16049883
 ] 

Takeshi Yamamuro commented on SPARK-21101:
--

You just don't pass your uber-jar into spark? Or, you mean the query above 
worked well on previous spark?

> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21102) Refresh command is too aggressive in parsing

2017-06-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-21102:
---

 Summary: Refresh command is too aggressive in parsing
 Key: SPARK-21102
 URL: https://issues.apache.org/jira/browse/SPARK-21102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


SQL REFRESH command parsing is way too aggressive:

{code}
| REFRESH TABLE tableIdentifier
#refreshTable
| REFRESH .*?  
#refreshResource
{code}

We should change it so it takes the whole string (without space), or a quoted 
string.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21102) Refresh command is too aggressive in parsing

2017-06-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21102:

Labels: starter  (was: )

> Refresh command is too aggressive in parsing
> 
>
> Key: SPARK-21102
> URL: https://issues.apache.org/jira/browse/SPARK-21102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>  Labels: starter
>
> SQL REFRESH command parsing is way too aggressive:
> {code}
> | REFRESH TABLE tableIdentifier
> #refreshTable
> | REFRESH .*?  
> #refreshResource
> {code}
> We should change it so it takes the whole string (without space), or a quoted 
> string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21079) ANALYZE TABLE fails to calculate totalSize for a partitioned table

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21079:


Assignee: (was: Apache Spark)

> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> --
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces 
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size 
> of files in the corresponding directory recursively. However, for partitioned 
> tables, each partition has its own storage URI which may not be a 
> subdirectory of the table-level storage URI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21079) ANALYZE TABLE fails to calculate totalSize for a partitioned table

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21079:


Assignee: Apache Spark

> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> --
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>Assignee: Apache Spark
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces 
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size 
> of files in the corresponding directory recursively. However, for partitioned 
> tables, each partition has its own storage URI which may not be a 
> subdirectory of the table-level storage URI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-21079) ANALYZE TABLE fails to calculate totalSize for a partitioned table

2017-06-14 Thread Maria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maria updated SPARK-21079:
--
Comment: was deleted

(was: [~ZenWzh], here is a PR: https://github.com/apache/spark/pull/18309)

> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> --
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces 
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size 
> of files in the corresponding directory recursively. However, for partitioned 
> tables, each partition has its own storage URI which may not be a 
> subdirectory of the table-level storage URI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21079) ANALYZE TABLE fails to calculate totalSize for a partitioned table

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049877#comment-16049877
 ] 

Apache Spark commented on SPARK-21079:
--

User 'mbasmanova' has created a pull request for this issue:
https://github.com/apache/spark/pull/18309

> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> --
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces 
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size 
> of files in the corresponding directory recursively. However, for partitioned 
> tables, each partition has its own storage URI which may not be a 
> subdirectory of the table-level storage URI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21079) ANALYZE TABLE fails to calculate totalSize for a partitioned table

2017-06-14 Thread Maria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049879#comment-16049879
 ] 

Maria commented on SPARK-21079:
---

[~ZenWzh], here is a PR: https://github.com/apache/spark/pull/18309

> ANALYZE TABLE fails to calculate totalSize for a partitioned table
> --
>
> Key: SPARK-21079
> URL: https://issues.apache.org/jira/browse/SPARK-21079
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ANALYZE TABLE table COMPUTE STATISTICS invoked for a partition table produces 
> totalSize = 0.
> AnalyzeTableCommand fetches table-level storage URI and calculated total size 
> of files in the corresponding directory recursively. However, for partitioned 
> tables, each partition has its own storage URI which may not be a 
> subdirectory of the table-level storage URI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049862#comment-16049862
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

FYI, Apache Spark STS uses 1 as a port number.

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049861#comment-16049861
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

This is the same result over beeline on branch-2.2. I think you didn't stop 
your previous STS.
{code}
~/s/spark-master:branch-2.2$ bin/beeline -u jdbc:hive2://localhost:1 -n 
hive -p password -e 'desc table t'
Connecting to jdbc:hive2://localhost:1
Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---++--+--+
| col_name  | data_type  | comment  |
+---++--+--+
| a | int| NULL |
+---++--+--+
1 row selected (0.122 seconds)
Beeline version 1.2.1.spark2 by Apache Hive
Closing: 0: jdbc:hive2://localhost:1
~/s/spark-master:branch-2.2$ bin/beeline -u jdbc:hive2://localhost:1 -n 
hive -p password -e 'desc extended t'
Connecting to jdbc:hive2://localhost:1
Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---++--+--+
|   col_name| data_type 
 | comment  |
+---++--+--+
| a | int   
 | NULL |
|   |   
 |  |
{code}

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0:

[jira] [Resolved] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21096.
--
Resolution: Not A Problem

I am resolving this. Please reopen this if I misunderstood.

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049852#comment-16049852
 ] 

Hyukjin Kwon commented on SPARK-21096:
--

This is because you are passing {{self}} which has SparkContext that is not 
serializable. Taking this out or 

{code}
def build_fail(self):
- processed = self.rdd.map(lambda row: process_row(row, 
self.multiplier))
+ aa = self.multiplier
+ processed = self.rdd.map(lambda row: process_row(row, aa))
return processed.collect()
{{code}

should work.

> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayou Zhou updated SPARK-21101:
---
Description: 
I'm using temporary UDTFs on Spark 2.2, e.g.

{noformat}
CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
'hdfs:///path/to/udf.jar'; 

But when I try to invoke it, I get the following error:

17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Any help appreciated, thanks.

  was:
I'm using temporary UDTFs on Spark 2.2, e.g.

CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
'hdfs:///path/to/udf.jar'; 

But when I try to invoke it, I get the following error:

17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any help appreciated, thanks.


> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> {noformat}
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
>

[jira] [Updated] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dayou Zhou updated SPARK-21101:
---
Description: 
I'm using temporary UDTFs on Spark 2.2, e.g.

CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
'hdfs:///path/to/udf.jar'; 

But when I try to invoke it, I get the following error:

{noformat}
17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Any help appreciated, thanks.

  was:
I'm using temporary UDTFs on Spark 2.2, e.g.

{noformat}
CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
'hdfs:///path/to/udf.jar'; 

But when I try to invoke it, I get the following error:

17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Any help appreciated, thanks.


> Error running Hive temporary UDTF on latest Spark 2.2
> -
>
> Key: SPARK-21101
> URL: https://issues.apache.org/jira/browse/SPARK-21101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dayou Zhou
>
> I'm using temporary UDTFs on Spark 2.2, e.g.
> CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
> 'hdfs:///path/to/udf.jar'; 
> But when I try to invoke it, I get the following error:
> {noformat}
> 17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
> at 
>

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Garros Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049844#comment-16049844
 ] 

Garros Chan commented on SPARK-20954:
-

Hi [~dongjoon]

Thanks for your confirmation!
Do you know when RC5 will come out?

Also, I just noticed this behavior.

I am using (spark-2.2.1-SNAPSHOT-bin-hadoop2.7.tgz  2017-06-14 09:44
194M)
I have looked at your test result above using Scala.

I tried the same thing and I did not see the extra line in DESC.

{code:java}
scala> sql("desc hiveint").show
++-+---+
|col_name|data_type|comment|
++-+---+
|  c1|  int|   null|
++-+---+
{code}

But when I tried with Beeline, it has that extra line again.

{code:java}
0: jdbc:hive2://localhost:10016> describe hiveint;
+-++--+--+
|  col_name   | data_type  | comment  |
+-++--+--+
| # col_name  | data_type  | comment  |
| c1  | int| NULL |
+-++--+--+
{code}

Any idea why please?


> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   |

[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-06-14 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049842#comment-16049842
 ] 

Wenchen Fan commented on SPARK-19900:
-

liyichao can you provide your JIRA id? thanks!

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
> Fix For: 2.3.0
>
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
>

[jira] [Resolved] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19900.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18084
[https://github.com/apache/spark/pull/18084]

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
> Fix For: 2.3.0
>
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049837#comment-16049837
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

Yep. 2.2.0-RC5. It also includes SPARK-12868, too.

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Garros Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049833#comment-16049833
 ] 

Garros Chan commented on SPARK-20954:
-

Hi [~dongjoon]

I see. Do you mean spark-2.2.0-rc5 :) ?

Also, would you be able to find out if rc5 will contain fix for SPARK-12868?

Thanks!

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED

[jira] [Created] (SPARK-21101) Error running Hive temporary UDTF on latest Spark 2.2

2017-06-14 Thread Dayou Zhou (JIRA)

Dayou Zhou created SPARK-21101:
--

 Summary: Error running Hive temporary UDTF on latest Spark 2.2
 Key: SPARK-21101
 URL: https://issues.apache.org/jira/browse/SPARK-21101
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Dayou Zhou


I'm using temporary UDTFs on Spark 2.2, e.g.

CREATE TEMPORARY FUNCTION myudtf AS 'com.foo.MyUdtf' USING JAR 
'hdfs:///path/to/udf.jar'; 

But when I try to invoke it, I get the following error:

17/06/14 19:43:50 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
'com.foo.MyUdtf': java.lang.NullPointerException; line 1 pos 7
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:266)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:184)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any help appreciated, thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049829#comment-16049829
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

RC5 is coming very soon with this fix. :)

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Garros Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049827#comment-16049827
 ] 

Garros Chan commented on SPARK-20954:
-

Hi [~dongjoon]

I see. Would you be able to tell me where I can get a latest build of 2.2.1 
containing this fix?
If not, who else I can ask or where I can find this information?

Thanks


> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049826#comment-16049826
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

FYI, the following is the result on `branch-2.2`. I'm not sure how the snapshot 
is built.
{code}
~/s/spark-master:branch-2.2$ current_shell
...
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("show tables").show
++-+---+
|database|tableName|isTemporary|
++-+---+
| default|t|  false|
++-+---+


scala> sql("desc t").show
++-+---+
|col_name|data_type|comment|
++-+---+
|   a|  int|   null|
++-+---+


scala> sql("desc extended t").show
+++---+
|col_name|   data_type|comment|
+++---+
|   a| int|   null|
...
{code}

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1

[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-14 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049816#comment-16049816
 ] 

Dayou Zhou commented on SPARK-18294:


Hi [~jiangxb1987] does this answer your question?  Any help appreciated, thanks.

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-06-14 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049814#comment-16049814
 ] 

Dayou Zhou commented on SPARK-12868:


Hi [~tleftwich] and all,
I was using Spark 2.0 and when I tried to invoke Hive UDTF, I got the following 
error:

Undefined function: '...'. This function is neither a registered temporary 
function nor a permanent function registered in the database 'default'.; 

Then I picked up Spark 2.2 with this fix and the above error went away, which 
is good.  However, when I tried to create new UDTFs and invoke them, I again 
ran into the above error.  Any idea why?  Any help appreciated.

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21099:


Assignee: Apache Spark

> INFO Log Message Using Incorrect Executor Idle Timeout Value
> 
>
> Key: SPARK-21099
> URL: https://issues.apache.org/jira/browse/SPARK-21099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Hazem Mahmoud
>Assignee: Apache Spark
>Priority: Trivial
>
> INFO log message is using the wrong idle timeout 
> (spark.dynamicAllocation.executorIdleTimeout) when printing the message that 
> the executor holding the RDD cache is being removed.
> INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been 
> idle for 30 seconds (new desired total will be 0)
> It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the 
> RDD cache timeout is reached. I was able to confirm this by doing the 
> following:
> 1. Update spark-defaults.conf to set the following:
> executorIdleTimeout=30
> cachedExecutorIdleTimeout=20
> 2. Update log4j.properties to set the following:
> shell.log.level=INFO
> 3. Run the following in spark-shell:
> scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234")
> scala> textFile.cache().count()
> 4. After 30 secs you will see 2 timeout messages, but of which are 30 secs 
> (whereas one *should* be for 20 secs)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049790#comment-16049790
 ] 

Apache Spark commented on SPARK-21099:
--

User 'ihazem' has created a pull request for this issue:
https://github.com/apache/spark/pull/18308

> INFO Log Message Using Incorrect Executor Idle Timeout Value
> 
>
> Key: SPARK-21099
> URL: https://issues.apache.org/jira/browse/SPARK-21099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Hazem Mahmoud
>Priority: Trivial
>
> INFO log message is using the wrong idle timeout 
> (spark.dynamicAllocation.executorIdleTimeout) when printing the message that 
> the executor holding the RDD cache is being removed.
> INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been 
> idle for 30 seconds (new desired total will be 0)
> It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the 
> RDD cache timeout is reached. I was able to confirm this by doing the 
> following:
> 1. Update spark-defaults.conf to set the following:
> executorIdleTimeout=30
> cachedExecutorIdleTimeout=20
> 2. Update log4j.properties to set the following:
> shell.log.level=INFO
> 3. Run the following in spark-shell:
> scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234")
> scala> textFile.cache().count()
> 4. After 30 secs you will see 2 timeout messages, but of which are 30 secs 
> (whereas one *should* be for 20 secs)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21099:


Assignee: (was: Apache Spark)

> INFO Log Message Using Incorrect Executor Idle Timeout Value
> 
>
> Key: SPARK-21099
> URL: https://issues.apache.org/jira/browse/SPARK-21099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Hazem Mahmoud
>Priority: Trivial
>
> INFO log message is using the wrong idle timeout 
> (spark.dynamicAllocation.executorIdleTimeout) when printing the message that 
> the executor holding the RDD cache is being removed.
> INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been 
> idle for 30 seconds (new desired total will be 0)
> It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the 
> RDD cache timeout is reached. I was able to confirm this by doing the 
> following:
> 1. Update spark-defaults.conf to set the following:
> executorIdleTimeout=30
> cachedExecutorIdleTimeout=20
> 2. Update log4j.properties to set the following:
> shell.log.level=INFO
> 3. Run the following in spark-shell:
> scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234")
> scala> textFile.cache().count()
> 4. After 30 secs you will see 2 timeout messages, but of which are 30 secs 
> (whereas one *should* be for 20 secs)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-14 Thread Garros Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049785#comment-16049785
 ] 

Garros Chan commented on SPARK-20954:
-

Hi [~dongjoon]

I downloaded (spark-2.2.1-SNAPSHOT-bin-hadoop2.7.tgz2017-06-14 09:44
194M) from 
https://people.apache.org/~pwendell/spark-nightly/spark-branch-2.2-bin/latest/

I still see that one extra line in DESCRIBE.
Does this latest tgz contains this JIRA fix?

Thanks


> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access

[jira] [Updated] (SPARK-21084) Improvements to dynamic allocation for notebook use cases

2017-06-14 Thread Frederick Reiss (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-21084:

Description: 
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current planned work:
* [SPARK-21097]: Preserve an executor's cached data when shutting down the 
executor.
* (JIRA TBD): Make Spark give up executors in a controlled fashion when the RM 
indicates it is running low on capacity.
* (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors.

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.

  was:
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current set of planned enhancements:
* [SPARK-21097]:Preserve an executor's cached data when shutting down the 
executor 

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.


> Improvements to dynamic allocation for notebook use cases
> -
>
> Key: SPARK-21084
> URL: https://issues.apache.org/jira/browse/SPARK-21084
> Project: Spark
>  Issue Type: Umbrella
>  Components: Block Manager, Scheduler, Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Frederick Reiss
>
> One important application of Spark is to support many notebook users with a 
> single YARN or Spark Standalone cluster.  We at IBM have seen this 
> requirement across multiple deployments of Spark: on-premises and private 
> cloud deployments at our clients, as well as on the IBM cloud.  The scenario

[jira] [Assigned] (SPARK-21100) describe should give quartiles similar to Pandas

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21100:


Assignee: (was: Apache Spark)

> describe should give quartiles similar to Pandas
> 
>
> Key: SPARK-21100
> URL: https://issues.apache.org/jira/browse/SPARK-21100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>Priority: Minor
>
> The DataFrame describe method should also include quartiles (25th, 50th, and 
> 75th percentiles) like Pandas.
> Example pandas output:
> {code}
> In [4]: df.describe()
> Out[4]:
>Unnamed: 0   displ year cyl cty hwy
> count  234.00  234.00   234.00  234.00  234.00  234.00
> mean   117.503.471795  2003.505.89   16.858974   23.440171
> std 67.6941651.291959 4.5096461.6115344.2559465.954643
> min  1.001.60  1999.004.009.00   12.00
> 25% 59.252.40  1999.004.00   14.00   18.00
> 50%117.503.30  2003.506.00   17.00   24.00
> 75%175.754.60  2008.008.00   19.00   27.00
> max234.007.00  2008.008.00   35.00   44.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21100) describe should give quartiles similar to Pandas

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049701#comment-16049701
 ] 

Apache Spark commented on SPARK-21100:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/18307

> describe should give quartiles similar to Pandas
> 
>
> Key: SPARK-21100
> URL: https://issues.apache.org/jira/browse/SPARK-21100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>Priority: Minor
>
> The DataFrame describe method should also include quartiles (25th, 50th, and 
> 75th percentiles) like Pandas.
> Example pandas output:
> {code}
> In [4]: df.describe()
> Out[4]:
>Unnamed: 0   displ year cyl cty hwy
> count  234.00  234.00   234.00  234.00  234.00  234.00
> mean   117.503.471795  2003.505.89   16.858974   23.440171
> std 67.6941651.291959 4.5096461.6115344.2559465.954643
> min  1.001.60  1999.004.009.00   12.00
> 25% 59.252.40  1999.004.00   14.00   18.00
> 50%117.503.30  2003.506.00   17.00   24.00
> 75%175.754.60  2008.008.00   19.00   27.00
> max234.007.00  2008.008.00   35.00   44.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21100) describe should give quartiles similar to Pandas

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21100:


Assignee: Apache Spark

> describe should give quartiles similar to Pandas
> 
>
> Key: SPARK-21100
> URL: https://issues.apache.org/jira/browse/SPARK-21100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>Assignee: Apache Spark
>Priority: Minor
>
> The DataFrame describe method should also include quartiles (25th, 50th, and 
> 75th percentiles) like Pandas.
> Example pandas output:
> {code}
> In [4]: df.describe()
> Out[4]:
>Unnamed: 0   displ year cyl cty hwy
> count  234.00  234.00   234.00  234.00  234.00  234.00
> mean   117.503.471795  2003.505.89   16.858974   23.440171
> std 67.6941651.291959 4.5096461.6115344.2559465.954643
> min  1.001.60  1999.004.009.00   12.00
> 25% 59.252.40  1999.004.00   14.00   18.00
> 50%117.503.30  2003.506.00   17.00   24.00
> 75%175.754.60  2008.008.00   19.00   27.00
> max234.007.00  2008.008.00   35.00   44.00
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21100) describe should give quartiles similar to Pandas

2017-06-14 Thread Andrew Ray (JIRA)

Andrew Ray created SPARK-21100:
--

 Summary: describe should give quartiles similar to Pandas
 Key: SPARK-21100
 URL: https://issues.apache.org/jira/browse/SPARK-21100
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Andrew Ray
Priority: Minor


The DataFrame describe method should also include quartiles (25th, 50th, and 
75th percentiles) like Pandas.

Example pandas output:
{code}
In [4]: df.describe()
Out[4]:
   Unnamed: 0   displ year cyl cty hwy
count  234.00  234.00   234.00  234.00  234.00  234.00
mean   117.503.471795  2003.505.89   16.858974   23.440171
std 67.6941651.291959 4.5096461.6115344.2559465.954643
min  1.001.60  1999.004.009.00   12.00
25% 59.252.40  1999.004.00   14.00   18.00
50%117.503.30  2003.506.00   17.00   24.00
75%175.754.60  2008.008.00   19.00   27.00
max234.007.00  2008.008.00   35.00   44.00
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21084) Improvements to dynamic allocation for notebook use cases

2017-06-14 Thread Frederick Reiss (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederick Reiss updated SPARK-21084:

Description: 
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current set of planned enhancements:
* [SPARK-21097]:Preserve an executor's cached data when shutting down the 
executor 

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.

  was:
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.



> Improvements to dynamic allocation for notebook use cases
> -
>
> Key: SPARK-21084
> URL: https://issues.apache.org/jira/browse/SPARK-21084
> Project: Spark
>  Issue Type: Umbrella
>  Components: Block Manager, Scheduler, Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Frederick Reiss
>
> One important application of Spark is to support many notebook users with a 
> single YARN or Spark Standalone cluster.  We at IBM have seen this 
> requirement across multiple deployments of Spark: on-premises and private 
> cloud deployments at our clients, as well as on the IBM cloud.  The scenario 
> goes something like this: "Every morning at 9am, 500 analysts log into their 
> computers and start running Spark notebooks intermittently for the next 8 
> hours." I'm sure that many other members of the community are interested in 
> making similar scenarios work.
> 
> Dynamic allocation is supposed to support these kinds of use cases by 
> shifting cluster resources towards users who are currently executing scalable 
> code.  In our own testing, we have encountered a

[jira] [Resolved] (SPARK-21091) Move constraint code into QueryPlanConstraints

2017-06-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-21091.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Move constraint code into QueryPlanConstraints
> --
>
> Key: SPARK-21091
> URL: https://issues.apache.org/jira/browse/SPARK-21091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21084) Improvements to dynamic allocation for notebook use cases

2017-06-14 Thread Frederick Reiss (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049669#comment-16049669
 ] 

Frederick Reiss commented on SPARK-21084:
-

[~sowen] thanks for having a look at this JIRA and giving feedback!

I must confess that, when our product groups first brought these issues to my 
attention, my initial response was similar to yours. Each of the issues 
described above can be fixed *in isolation* by reconfiguring Spark and/or the 
resource manager.  The problem is that every such fix makes the other issues 
worse.  We spent a number of weeks playing best practices whack-a-mole before 
resigning ourselves to making some targeted improvements to Spark itself.

I'll update the description of this JIRA in a moment with a high-level 
description of the Spark changes we're currently looking into.

In the meantime, here's a quick summary of what we ran into while attempting to 
devise a workable configuration of dynamic allocation for notebook users:
Issue #1 (starvation): The obvious fix here is preemption. But there is 
currently no way to preempt an executor gently. The only option is to shut down 
the executor and drop its data, which leads to issues #2 and #4.  Worse, 
Spark's scheduling and cache management are opaque to the resource manager, so 
the RM makes arbitrary choices of which executor to shoot.
Another approach is to configure 
{{spark.dynamicAllocation.cachedExecutorIdleTimeout}} so that notebook sessions 
voluntarily give up executors, even when those executors have cached data. But 
this leads to issues #2 and #4.

Issue #2 (request latency): This issue has two root causes: 
a) It takes a noticeable amount of time to start and ramp up new executors.
b) Spark defers the cost of issue #4 (losing cached data) until a job attempts 
to consume the missing data.
For root cause (a), the obvious solution is to reserve a permanent minimum pool 
of executors for each notebook user by setting the 
{{spark.dynamicAllocation.minExecutors}} parameter to a sufficiently high 
value. But tying down containers in this way leaves fewer resources for other 
users, exacerbating issues #1, #3, and #4. The reserved executors are likely to 
be idle most of the time, because notebook users alternate between running 
Spark jobs, running local computation in the notebook kernel, and looking at 
results in the web browser. 
See issue #4 below for what happens when you try to address root cause (b) with 
config changes.

Issue #3 (unfair allocation of CPU): The obvious fix here is to set 
{{spark.dynamicAllocation.cachedExecutorIdleTimeout}} so that notebook sessions 
voluntarily give up executors, even when those executors have cached data. But 
this leads to issues #2 and #4. One can also reduce the value of 
{{spark.dynamicAllocation.maxExecutors}}, but that puts a cap on the degree of 
parallelism that a given user can access, leading to more of issue #2.

Issue #4 (loss of cached data): The obvious fix here is to set 
{{spark.dynamicAllocation.cachedExecutorIdleTimeout}} to infinity. But then any 
notebook user who has called RDD.cache() at some point in the past will tie 
down a large pool of containers indefinitely, leading to issues #1, #2, and #3. 
If you attempt to limit the size of this large pool by reducing 
{{spark.dynamicAllocation.maxExecutors}}, you limit the peak performance that 
the notebook user can get out of Spark, leading to issue #2.


> Improvements to dynamic allocation for notebook use cases
> -
>
> Key: SPARK-21084
> URL: https://issues.apache.org/jira/browse/SPARK-21084
> Project: Spark
>  Issue Type: Umbrella
>  Components: Block Manager, Scheduler, Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Frederick Reiss
>
> One important application of Spark is to support many notebook users with a 
> single YARN or Spark Standalone cluster.  We at IBM have seen this 
> requirement across multiple deployments of Spark: on-premises and private 
> cloud deployments at our clients, as well as on the IBM cloud.  The scenario 
> goes something like this: "Every morning at 9am, 500 analysts log into their 
> computers and start running Spark notebooks intermittently for the next 8 
> hours." I'm sure that many other members of the community are interested in 
> making similar scenarios work.
> 
> Dynamic allocation is supposed to support these kinds of use cases by 
> shifting cluster resources towards users who are currently executing scalable 
> code.  In our own testing, we have encountered a number of issues with using 
> the current implementation of dynamic allocation for this purpose:
> *Issue #1: Starvation.* A Spark job acquires all available containers, 
> preventing other jobs or applications from starting.
> *Issue #2: Request latency.* Jobs

[jira] [Updated] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value

2017-06-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21099:
--
Priority: Trivial  (was: Major)

> INFO Log Message Using Incorrect Executor Idle Timeout Value
> 
>
> Key: SPARK-21099
> URL: https://issues.apache.org/jira/browse/SPARK-21099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Hazem Mahmoud
>Priority: Trivial
>
> INFO log message is using the wrong idle timeout 
> (spark.dynamicAllocation.executorIdleTimeout) when printing the message that 
> the executor holding the RDD cache is being removed.
> INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been 
> idle for 30 seconds (new desired total will be 0)
> It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the 
> RDD cache timeout is reached. I was able to confirm this by doing the 
> following:
> 1. Update spark-defaults.conf to set the following:
> executorIdleTimeout=30
> cachedExecutorIdleTimeout=20
> 2. Update log4j.properties to set the following:
> shell.log.level=INFO
> 3. Run the following in spark-shell:
> scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234")
> scala> textFile.cache().count()
> 4. After 30 secs you will see 2 timeout messages, but of which are 30 secs 
> (whereas one *should* be for 20 secs)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value

2017-06-14 Thread Hazem Mahmoud (JIRA)

Hazem Mahmoud created SPARK-21099:
-

 Summary: INFO Log Message Using Incorrect Executor Idle Timeout 
Value
 Key: SPARK-21099
 URL: https://issues.apache.org/jira/browse/SPARK-21099
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 1.6.0
Reporter: Hazem Mahmoud


INFO log message is using the wrong idle timeout 
(spark.dynamicAllocation.executorIdleTimeout) when printing the message that 
the executor holding the RDD cache is being removed.

INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been 
idle for 30 seconds (new desired total will be 0)

It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the 
RDD cache timeout is reached. I was able to confirm this by doing the following:

1. Update spark-defaults.conf to set the following:
executorIdleTimeout=30
cachedExecutorIdleTimeout=20
2. Update log4j.properties to set the following:
shell.log.level=INFO
3. Run the following in spark-shell:
scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234")
scala> textFile.cache().count()
4. After 30 secs you will see 2 timeout messages, but of which are 30 secs 
(whereas one *should* be for 20 secs)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21029) All StreamingQuery should be stopped when the SparkSession is stopped

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21029:


Assignee: Apache Spark

> All StreamingQuery should be stopped when the SparkSession is stopped
> -
>
> Key: SPARK-21029
> URL: https://issues.apache.org/jira/browse/SPARK-21029
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21029) All StreamingQuery should be stopped when the SparkSession is stopped

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049626#comment-16049626
 ] 

Apache Spark commented on SPARK-21029:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/18306

> All StreamingQuery should be stopped when the SparkSession is stopped
> -
>
> Key: SPARK-21029
> URL: https://issues.apache.org/jira/browse/SPARK-21029
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21029) All StreamingQuery should be stopped when the SparkSession is stopped

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21029:


Assignee: (was: Apache Spark)

> All StreamingQuery should be stopped when the SparkSession is stopped
> -
>
> Key: SPARK-21029
> URL: https://issues.apache.org/jira/browse/SPARK-21029
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20988:


Assignee: (was: Apache Spark)

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049519#comment-16049519
 ] 

Apache Spark commented on SPARK-20988:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/18305

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20988:


Assignee: Apache Spark

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21098) Add line separator option to csv read/write

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21098:


Assignee: (was: Apache Spark)

> Add line separator option to csv read/write
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> In order to allow users to work with csv files with non-unix line endings, it 
> would be nice to allow users to pass their line separator (aka newline 
> character) as an option to the spark.read.csv command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21098) Add line separator option to csv read/write

2017-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049509#comment-16049509
 ] 

Apache Spark commented on SPARK-21098:
--

User 'danielvdende' has created a pull request for this issue:
https://github.com/apache/spark/pull/18304

> Add line separator option to csv read/write
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> In order to allow users to work with csv files with non-unix line endings, it 
> would be nice to allow users to pass their line separator (aka newline 
> character) as an option to the spark.read.csv command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21098) Add line separator option to csv read/write

2017-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21098:


Assignee: Apache Spark

> Add line separator option to csv read/write
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Apache Spark
>Priority: Minor
>
> In order to allow users to work with csv files with non-unix line endings, it 
> would be nice to allow users to pass their line separator (aka newline 
> character) as an option to the spark.read.csv command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16669) Partition pruning for metastore relation size estimates for better join selection.

2017-06-14 Thread Zhenhua Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang closed SPARK-16669.

Resolution: Duplicate

> Partition pruning for metastore relation size estimates for better join 
> selection.
> --
>
> Key: SPARK-16669
> URL: https://issues.apache.org/jira/browse/SPARK-16669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Parth Brahmbhatt
>
>  Currently the metastore statistics returns the size of entire table which 
> results in Join selection strategy to not use broadcast joins even when only 
> a single partition from a large table is selected. We should optimize the 
> statistic calculation at table level to apply partition pruning and only get 
> the size of Partition that are valid for the query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21098) Add line separator option to csv read/write

2017-06-14 Thread Daniel van der Ende (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel van der Ende updated SPARK-21098:

Summary: Add line separator option to csv read/write  (was: Add line 
separator option to csv)

> Add line separator option to csv read/write
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Priority: Minor
>
> In order to allow users to work with csv files with non-unix line endings, it 
> would be nice to allow users to pass their line separator (aka newline 
> character) as an option to the spark.read.csv command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21098) Add line separator option to csv

2017-06-14 Thread Daniel van der Ende (JIRA)

Daniel van der Ende created SPARK-21098:
---

 Summary: Add line separator option to csv
 Key: SPARK-21098
 URL: https://issues.apache.org/jira/browse/SPARK-21098
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, SQL
Affects Versions: 2.2.1
Reporter: Daniel van der Ende
Priority: Minor


In order to allow users to work with csv files with non-unix line endings, it 
would be nice to allow users to pass their line separator (aka newline 
character) as an option to the spark.read.csv command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-14 Thread Dominic Ricard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Ricard updated SPARK-21067:
---
Description: 
After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
would fail, sometimes...

Most of the time, the CTAS would work only once, after starting the thrift 
server. After that, dropping the table and re-issuing the same CTAS would fail 
with the following message (Sometime, it fails right away, sometime it work for 
a long period of time):

{noformat}
Error: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
 to destination 
hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
(state=,code=0)
{noformat}

We have already found the following Jira 
(https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
{{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
ours set to "/tmp/hive-staging/\{user.name\}"

Same issue with INSERT statements:
{noformat}
CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
dricard.test SELECT 1;
Error: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
 to destination 
hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
(state=,code=0)
{noformat}

This worked fine in 1.6.2, which we currently run in our Production Environment 
but since 2.0+, we haven't been able to CREATE TABLE consistently on the 
cluster.

SQL to reproduce issue:
{noformat}
DROP SCHEMA IF EXISTS dricard CASCADE; 
CREATE SCHEMA dricard; 
CREATE TABLE dricard.test (col1 int); 
INSERT INTO TABLE dricard.test SELECT 1; 
SELECT * from dricard.test; 
DROP TABLE dricard.test; 
CREATE TABLE dricard.test AS select 1 as `col1`;
SELECT * from dricard.test
{noformat}

Thrift server usually fails at INSERT...

Tried the same procedure in a spark context using spark.sql() and didn't 
encounter the same issue.

Full stack Trace:
{noformat}
17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
executing query, currentState RUNNING,
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
 to desti
nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.Dataset.(Dataset.scala:185)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:231)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:174)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at

[jira] [Comment Edited] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-14 Thread Brad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049485#comment-16049485
 ] 

Brad edited comment on SPARK-21097 at 6/14/17 6:29 PM:
---

Hey [~srowen], thanks for your input. I would definitely like to do some 
benchmarks and show that the recovered data leads to a performance improvement. 
I don't think it will increase complexity too much, If there are any issues 
copying the data, we can just go ahead and kill the executor and fall back to 
the current behavior. I've done some preliminary work on this and the changes 
to existing code will be minimal and unlikely to damage existing functionality.


was (Author: bradkaiser):
Hey Sean, thanks for your input. I would definitely like to do some benchmarks 
and show that the recovered data leads to a performance improvement. I don't 
think it will increase complexity too much, If there are any issues copying the 
data, we can just go ahead and kill the executor and fall back to the current 
behavior. I've done some preliminary work on this and the changes to existing 
code will be minimal and unlikely to damage existing functionality.

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-14 Thread Brad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049485#comment-16049485
 ] 

Brad commented on SPARK-21097:
--

Hey Sean, thanks for your input. I would definitely like to do some benchmarks 
and show that the recovered data leads to a performance improvement. I don't 
think it will increase complexity too much, If there are any issues copying the 
data, we can just go ahead and kill the executor and fall back to the current 
behavior. I've done some preliminary work on this and the changes to existing 
code will be minimal and unlikely to damage existing functionality.

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21088) CrossValidator, TrainValidationSplit should preserve all models after fitting: Python

2017-06-14 Thread Ajay Saini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049447#comment-16049447
 ] 

Ajay Saini commented on SPARK-21088:


I'll work on this one.

> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Python
> -
>
> Key: SPARK-21088
> URL: https://issues.apache.org/jira/browse/SPARK-21088
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049442#comment-16049442
 ] 

Sean Owen commented on SPARK-21097:
---

This seems to add a fair bit of complexity when Spark is already designed to 
recover cached data if needed. It's not clear that it's better to spend the 
cycles copying this data round, delaying the removal of the executor, 
introducing new corner cases and semantics, etc. For example: what if some 
copies fail? do you proceed? what if the target dies? 

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-14 Thread Brad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049443#comment-16049443
 ] 

Brad commented on SPARK-21097:
--

I am working on this now and will be posting a more detailed design document 
shortly. I am definitely open to any collaboration or input. 

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
> executor reaches its configured idle timeout, instead of just killing it on 
> the spot, we will stop sending it new tasks, replicate all of its rdd blocks 
> onto other executors, and then kill it. If there is an issue while we 
> replicate the data, like an error, it takes too long, or there isn't enough 
> space, then we will fall back to the original behavior and drop the data and 
> kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
{quote}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

bq. def build_fail(self):
bq. processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
bq. return processed.collect()

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

bq. def build_fail(self):
bq. processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
bq. return processed.collect()

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> bq. def build_fail(self):
> bq. processed = self.rdd.map(lambda row: process_row(row, 
> self.multiplier))
> bq. return processed.collect()
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
{quote}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {{
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> }}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}
def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
{quote}

While this method will run just fine:

{quote}
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}
> def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect()
> {quote}
> While this method will run just fine:
> {quote}
> def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{{def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect()
}}

While this method will run just fine:

{{
def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect(){quote}
> While this method will run just fine:
> {quote}def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> }}{quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21097) Dynamic allocation will preserve cached data

2017-06-14 Thread Brad (JIRA)

Brad created SPARK-21097:


 Summary: Dynamic allocation will preserve cached data
 Key: SPARK-21097
 URL: https://issues.apache.org/jira/browse/SPARK-21097
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Scheduler, Spark Core
Affects Versions: 2.2.0, 2.3.0
Reporter: Brad


We want to use dynamic allocation to distribute resources among many notebook 
users on our spark clusters. One difficulty is that if a user has cached data 
then we are either prevented from de-allocating any of their executors, or we 
are forced to drop their cached data, which can lead to a bad user experience.

We propose adding a feature to preserve cached data by copying it to other 
executors before de-allocation. This behavior would be enabled by a simple 
spark config like "spark.dynamicAllocation.recoverCachedData". Now when an 
executor reaches its configured idle timeout, instead of just killing it on the 
spot, we will stop sending it new tasks, replicate all of its rdd blocks onto 
other executors, and then kill it. If there is an issue while we replicate the 
data, like an error, it takes too long, or there isn't enough space, then we 
will fall back to the original behavior and drop the data and kill the executor.

This feature should allow anyone with notebook users to use their cluster 
resources more efficiently. Also since it will be completely opt-in it will 
unlikely to cause problems for other use cases. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21096) Pickle error when passing a member variable to Spark executors

2017-06-14 Thread Irina Truong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Truong updated SPARK-21096:
-
Description: 
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?

  was:
There is a pickle error when submitting a spark job that references a member 
variable in a lambda, even when the member variable is a simple type that 
should be serializable.

Here is a minimal example:

https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278

In the gist above, this method will throw an exception:

{quote}def build_fail(self):
processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
return processed.collect(){quote}

While this method will run just fine:

{quote}def build_ok(self):
mult = self.multiplier
processed = self.rdd.map(lambda row: process_row(row, mult))
return processed.collect()
}}{quote}

In this example, {{self.multiplier}} is just an int. However, passing it into a 
lambda throws a pickle error, because it is trying to pickle the whole 
{{self}}, and that contains {{sc}}.

If this is the expected behavior, then why should re-assigning 
{{self.multiplier}} to a variable make a difference?


> Pickle error when passing a member variable to Spark executors
> --
>
> Key: SPARK-21096
> URL: https://issues.apache.org/jira/browse/SPARK-21096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Irina Truong
>
> There is a pickle error when submitting a spark job that references a member 
> variable in a lambda, even when the member variable is a simple type that 
> should be serializable.
> Here is a minimal example:
> https://gist.github.com/j-bennet/8390c6d9a81854696f1a9b42a4ea8278
> In the gist above, this method will throw an exception:
> {quote}def build_fail(self):
> processed = self.rdd.map(lambda row: process_row(row, self.multiplier))
> return processed.collect(){quote}
> While this method will run just fine:
> {quote}def build_ok(self):
> mult = self.multiplier
> processed = self.rdd.map(lambda row: process_row(row, mult))
> return processed.collect()
> {quote}
> In this example, {{self.multiplier}} is just an int. However, passing it into 
> a lambda throws a pickle error, because it is trying to pickle the whole 
> {{self}}, and that contains {{sc}}.
> If this is the expected behavior, then why should re-assigning 
> {{self.multiplier}} to a variable make a difference?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 154 matches

Mail list logo