[jira] [Assigned] (SPARK-15888) Python UDF over aggregate fails

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15888:


Assignee: Apache Spark  (was: Davies Liu)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Apache Spark
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15888) Python UDF over aggregate fails

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331266#comment-15331266
 ] 

Apache Spark commented on SPARK-15888:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/13682

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15888) Python UDF over aggregate fails

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15888:


Assignee: Davies Liu  (was: Apache Spark)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-06-15 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331285#comment-15331285
 ] 

SuYan edited comment on SPARK-15815 at 6/15/16 7:17 AM:


I see... although it can solve the hang problem, but for Dynamic Allocate, that 
solution seems a little rude, because spark have the ability to got another 
executors to complete 4 times failure or success finally


was (Author: suyan):
I see... although it can solve the gang problem, but for Dynamic Allocate, that 
solution seems a little rude, because spark have the ability to got another 
executors to complete 4 times failure or success finally

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-06-15 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331285#comment-15331285
 ] 

SuYan commented on SPARK-15815:
---

I see... although it can solve the gang problem, but for Dynamic Allocate, that 
solution seems a little rude, because spark have the ability to got another 
executors to complete 4 times failure or success finally

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error

2016-06-15 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331294#comment-15331294
 ] 

Dongjoon Hyun commented on SPARK-15922:
---

Hi, [~mengxr].
Could you review the PR on this bug?

> BlockMatrix to IndexedRowMatrix throws an error
> ---
>
> Key: SPARK-15922
> URL: https://issues.apache.org/jira/browse/SPARK-15922
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> {code}
> import org.apache.spark.mllib.linalg.distributed._
> import org.apache.spark.mllib.linalg._
> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, 
> new DenseVector(Array(1,2,3))):: IndexedRow(2L, new 
> DenseVector(Array(1,2,3))):: Nil
> val rdd = sc.parallelize(rows)
> val matrix = new IndexedRowMatrix(rdd, 3, 3)
> val bmat = matrix.toBlockMatrix
> val imat = bmat.toIndexedRowMatrix
> imat.rows.collect // this throws an error - Caused by: 
> java.lang.IllegalArgumentException: requirement failed: Vectors must be the 
> same length!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) Python UDF over aggregate fails

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15888:

Affects Version/s: (was: 2.0.0)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1004) PySpark on YARN

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331314#comment-15331314
 ] 

Apache Spark commented on SPARK-1004:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/30

> PySpark on YARN
> ---
>
> Key: SPARK-1004
> URL: https://issues.apache.org/jira/browse/SPARK-1004
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Josh Rosen
>Assignee: Sandy Ryza
>Priority: Blocker
> Fix For: 1.0.0
>
>
> This is for tracking progress on supporting YARN in PySpark.
> We might be able to use {{yarn-client}} mode 
> (https://spark.incubator.apache.org/docs/latest/running-on-yarn.html#launch-spark-application-with-yarn-client-mode).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero

2016-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331325#comment-15331325
 ] 

Sean Owen commented on SPARK-15955:
---

Which process has status 0, and can you try vs master?

> Failed Spark application returns with exitcode equals to zero
> -
>
> Key: SPARK-15955
> URL: https://issues.apache.org/jira/browse/SPARK-15955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Set up cluster with wire-encryption enabled.
> * set 'spark.authenticate.enableSaslEncryption' = 'false' and 
> 'spark.shuffle.service.enabled' :'true'
> * run sparkPi application.
> {code}
> client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
> diagnostics: Max number of executor failures (3) reached
> ApplicationMaster host: xx.xx.xx.xxx
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1465941051976
> final status: FAILED
> tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/
> user: hrt_qa
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1465925772890_0016 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO ShutdownHookManager: Shutdown hook called{code}
> This spark application exits with exitcode = 0. Failed application should not 
> return with exitcode = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-15 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331371#comment-15331371
 ] 

Pete Robbins commented on SPARK-15822:
--

The generated code is:

{code}
Top Arrival Carrier Cancellations:
Found 5 WholeStageCodegen subtrees.
== Subtree 1 / 5 ==
*HashAggregate(key=[Origin#16,UniqueCarrier#8], functions=[partial_count(1)], 
output=[Origin#16,UniqueCarrier#8,count#296L])
+- *Project [UniqueCarrier#8, Origin#16]
   +- *Filter (((isnotnull(Origin#16) && isnotnull(UniqueCarrier#8)) && 
isnotnull(Cancelled#21)) && isnotnull(CancellationCode#22)) && NOT 
(Cancelled#21 = 0)) && (CancellationCode#22 = A)) && isnotnull(Dest#17)) && 
(Dest#17 = ORD))
  +- *Scan csv 
[UniqueCarrier#8,Origin#16,Dest#17,Cancelled#21,CancellationCode#22] Format: 
CSV, InputPaths: file:/home/robbins/brandberry/2008.csv, PushedFilters: 
[IsNotNull(Origin), IsNotNull(UniqueCarrier), IsNotNull(Cancelled), 
IsNotNull(CancellationCode), ..., ReadSchema: 
struct

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private boolean agg_initAgg;
/* 008 */   private boolean agg_bufIsNull;
/* 009 */   private long agg_bufValue;
/* 010 */   private agg_VectorizedHashMap agg_vectorizedHashMap;
/* 011 */   private 
java.util.Iterator 
agg_vectorizedHashMapIter;
/* 012 */   private org.apache.spark.sql.execution.aggregate.HashAggregateExec 
agg_plan;
/* 013 */   private 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap;
/* 014 */   private org.apache.spark.sql.execution.UnsafeKVExternalSorter 
agg_sorter;
/* 015 */   private org.apache.spark.unsafe.KVIterator agg_mapIter;
/* 016 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_peakMemory;
/* 017 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_spillSize;
/* 018 */   private org.apache.spark.sql.execution.metric.SQLMetric 
scan_numOutputRows;
/* 019 */   private scala.collection.Iterator scan_input;
/* 020 */   private org.apache.spark.sql.execution.metric.SQLMetric 
filter_numOutputRows;
/* 021 */   private UnsafeRow filter_result;
/* 022 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder filter_holder;
/* 023 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
filter_rowWriter;
/* 024 */   private UnsafeRow project_result;
/* 025 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
/* 026 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
project_rowWriter;
/* 027 */   private UnsafeRow agg_result2;
/* 028 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 029 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
/* 030 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowJoiner 
agg_unsafeRowJoiner;
/* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
wholestagecodegen_numOutputRows;
/* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
wholestagecodegen_aggTime;
/* 033 */   private UnsafeRow wholestagecodegen_result;
/* 034 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
wholestagecodegen_holder;
/* 035 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
wholestagecodegen_rowWriter;
/* 036 */
/* 037 */   public GeneratedIterator(Object[] references) {
/* 038 */ this.references = references;
/* 039 */   }
/* 040 */
/* 041 */   public void init(int index, scala.collection.Iterator inputs[]) {
/* 042 */ partitionIndex = index;
/* 043 */ agg_initAgg = false;
/* 044 */
/* 045 */ agg_vectorizedHashMap = new agg_VectorizedHashMap();
/* 046 */
/* 047 */ this.agg_plan = 
(org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0];
/* 048 */
/* 049 */ this.agg_peakMemory = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[1];
/* 050 */ this.agg_spillSize = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[2];
/* 051 */ this.scan_numOutputRows = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[3];
/* 052 */ scan_input = inputs[0];
/* 053 */ this.filter_numOutputRows = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[4];
/* 054 */ filter_result = new UnsafeRow(5);
/* 055 */ this.filter_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(filter_result, 
128);
/* 056 */ this.filter_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(filter_holder,
 5);
/* 057 */ project_result = new Unsafe

[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-15 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331370#comment-15331370
 ] 

Pete Robbins commented on SPARK-15822:
--

Chatting with [~hvanhovell] here is the current state. I can reproduce a segv 
using local[8] on an 8 core machine. It is intermittent but  many many runs 
with eg local[2] produce no issues. The segv info is:

{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7fe8c118ca58, pid=3558, tid=140633451779840
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 7467 C1 org.apache.spark.unsafe.Platform.getByte(Ljava/lang/Object;J)B (9 
bytes) @ 0x7fe8c118ca58 [0x7fe8c118ca20+0x38]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---  T H R E A D  ---

Current thread (0x7fe858018800):  JavaThread "Executor task launch 
worker-3" daemon [_thread_in_Java, id=3698, 
stack(0x7fe7c6dfd000,0x7fe7c6efe000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 
0x00a09cf4

Registers:
RAX=0x7fe884ce5828, RBX=0x7fe884ce5828, RCX=0x7fe81e0a5360, 
RDX=0x00a09cf4
RSP=0x7fe7c6efb9e0, RBP=0x7fe7c6efba80, RSI=0x, 
RDI=0x3848
R8 =0x200b94c8, R9 =0xeef66bf0, R10=0x7fe8d87a2f00, 
R11=0x7fe8c118ca20
R12=0x, R13=0x7fe7c6efba28, R14=0x7fe7c6efba98, 
R15=0x7fe858018800
RIP=0x7fe8c118ca58, EFLAGS=0x00010206, CSGSFS=0x0033, 
ERR=0x0004
  TRAPNO=0x000e

Top of Stack: (sp=0x7fe7c6efb9e0)
0x7fe7c6efb9e0:   7fe7c56941e8 
0x7fe7c6efb9f0:   7fe7c6efbab0 7fe8c140c38c
0x7fe7c6efba00:   7fe8c1007d80 eef66bc8
0x7fe7c6efba10:   7fe7c6efba80 7fe8c1007700
0x7fe7c6efba20:   7fe8c1007700 00a09cf4
0x7fe7c6efba30:   0030 
0x7fe7c6efba40:   7fe7c6efba40 7fe81e0a1f9b
0x7fe7c6efba50:   7fe7c6efba98 7fe81e0a5360
0x7fe7c6efba60:    7fe81e0a1fc0
0x7fe7c6efba70:   7fe7c6efba28 7fe7c6efba90
0x7fe7c6efba80:   7fe7c6efbae8 7fe8c1007700
0x7fe7c6efba90:    ee4f4898
0x7fe7c6efbaa0:   004d 7fe7c6efbaa8
0x7fe7c6efbab0:   7fe81e0a42be 7fe7c6efbb18
0x7fe7c6efbac0:   7fe81e0a5360 
0x7fe7c6efbad0:   7fe81e0a4338 7fe7c6efba90
0x7fe7c6efbae0:   7fe7c6efbb10 7fe7c6efbb60
0x7fe7c6efbaf0:   7fe8c1007a40 
0x7fe7c6efbb00:    0003
0x7fe7c6efbb10:   ee4f4898 eef67950
0x7fe7c6efbb20:   7fe7c6efbb20 7fe81e0a43f2
0x7fe7c6efbb30:   7fe7c6efbb78 7fe81e0a5360
0x7fe7c6efbb40:    7fe81e0a4418
0x7fe7c6efbb50:   7fe7c6efbb10 7fe7c6efbb70
0x7fe7c6efbb60:   7fe7c6efbbc0 7fe8c1007a40
0x7fe7c6efbb70:   ee4f4898 eef67950
0x7fe7c6efbb80:   7fe7c6efbb80 7fe7c56844e5
0x7fe7c6efbb90:   7fe7c6efbc28 7fe7c5684950
0x7fe7c6efbba0:    7fe7c5684618
0x7fe7c6efbbb0:   7fe7c6efbb70 7fe7c6efbc18
0x7fe7c6efbbc0:   7fe7c6efbc70 7fe8c10077d0
0x7fe7c6efbbd0:     

Instructions: (pc=0x7fe8c118ca58)
0x7fe8c118ca38:   08 83 c7 08 89 78 08 48 b8 28 58 ce 84 e8 7f 00
0x7fe8c118ca48:   00 81 e7 f8 3f 00 00 83 ff 00 0f 84 16 00 00 00
0x7fe8c118ca58:   0f be 04 16 c1 e0 18 c1 f8 18 48 83 c4 30 5d 85
0x7fe8c118ca68:   05 93 c6 85 17 c3 48 89 44 24 08 48 c7 04 24 ff 

Register to memory mapping:

RAX={method} {0x7fe884ce5828} 'getByte' '(Ljava/lang/Object;J)B' in 
'org/apache/spark/unsafe/Platform'
RBX={method} {0x7fe884ce5828} 'getByte' '(Ljava/lang/Object;J)B' in 
'org/apache/spark/unsafe/Platform'
RCX=0x7fe81e0a5360 is pointing into metadata
RDX=0x00a09cf4 is an unknown value
RSP=0x7fe7c6efb9e0 is pointing into the stack for thread: 0x7fe858018800
RBP=0x7fe7c6efba80 is pointing into the stack for thread: 0x7fe858018800
RSI=0x is an unknown value
RDI=0x3848 is an unknown value
R8 =0x200b94c8 is an unknown value
R9 =0xeef66bf0 is an oop
[B 
 - klass: {type array byte}
 - length: 48
R10=0x7fe8d87a2f00:  in 
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el6_7.x86_64/jre/lib/amd64/server/libjvm.so

[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-15 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331375#comment-15331375
 ] 

Pete Robbins commented on SPARK-15822:
--

and the plan:

{noformat}
== Parsed Logical Plan ==
'Project [unresolvedalias('Origin, None), unresolvedalias('UniqueCarrier, 
None), 'round((('count * 100) / 'total), 2) AS rank#173]
+- Project [Origin#16, UniqueCarrier#8, count#134L, total#97L]
   +- Join Inner, ((Origin#16 = Origin#155) && (UniqueCarrier#8 = 
UniqueCarrier#147))
  :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, 
count(1) AS count#134L]
  :  +- Filter (NOT (Cancelled#21 = 0) && (CancellationCode#22 = A))
  : +- Filter (Dest#17 = ORD)
  :+- 
Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,...
 5 more fields] csv
  +- Project [Origin#155, UniqueCarrier#147, count#92L AS total#97L]
 +- Aggregate [Origin#155, UniqueCarrier#147], [Origin#155, 
UniqueCarrier#147, count(1) AS count#92L]
+- Filter (Dest#156 = ORD)
   +- 
Relation[Year#139,Month#140,DayofMonth#141,DayOfWeek#142,DepTime#143,CRSDepTime#144,ArrTime#145,CRSArrTime#146,UniqueCarrier#147,FlightNum#148,TailNum#149,ActualElapsedTime#150,CRSElapsedTime#151,AirTime#152,ArrDelay#153,DepDelay#154,Origin#155,Dest#156,Distance#157,TaxiIn#158,TaxiOut#159,Cancelled#160,CancellationCode#161,Diverted#162,...
 5 more fields] csv

== Analyzed Logical Plan ==
Origin: string, UniqueCarrier: string, rank: double
Project [Origin#16, UniqueCarrier#8, round((cast((count#134L * cast(100 as 
bigint)) as double) / cast(total#97L as double)), 2) AS rank#173]
+- Project [Origin#16, UniqueCarrier#8, count#134L, total#97L]
   +- Join Inner, ((Origin#16 = Origin#155) && (UniqueCarrier#8 = 
UniqueCarrier#147))
  :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, 
count(1) AS count#134L]
  :  +- Filter (NOT (Cancelled#21 = 0) && (CancellationCode#22 = A))
  : +- Filter (Dest#17 = ORD)
  :+- 
Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,...
 5 more fields] csv
  +- Project [Origin#155, UniqueCarrier#147, count#92L AS total#97L]
 +- Aggregate [Origin#155, UniqueCarrier#147], [Origin#155, 
UniqueCarrier#147, count(1) AS count#92L]
+- Filter (Dest#156 = ORD)
   +- 
Relation[Year#139,Month#140,DayofMonth#141,DayOfWeek#142,DepTime#143,CRSDepTime#144,ArrTime#145,CRSArrTime#146,UniqueCarrier#147,FlightNum#148,TailNum#149,ActualElapsedTime#150,CRSElapsedTime#151,AirTime#152,ArrDelay#153,DepDelay#154,Origin#155,Dest#156,Distance#157,TaxiIn#158,TaxiOut#159,Cancelled#160,CancellationCode#161,Diverted#162,...
 5 more fields] csv

== Optimized Logical Plan ==
Project [Origin#16, UniqueCarrier#8, round((cast((count#134L * 100) as double) 
/ cast(total#97L as double)), 2) AS rank#173]
+- Join Inner, ((Origin#16 = Origin#155) && (UniqueCarrier#8 = 
UniqueCarrier#147))
   :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, 
count(1) AS count#134L]
   :  +- Project [UniqueCarrier#8, Origin#16]
   : +- Filter (((isnotnull(Origin#16) && isnotnull(UniqueCarrier#8)) 
&& isnotnull(Cancelled#21)) && isnotnull(CancellationCode#22)) && NOT 
(Cancelled#21 = 0)) && (CancellationCode#22 = A)) && isnotnull(Dest#17)) && 
(Dest#17 = ORD))
   :+- 
Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,...
 5 more fields] csv
   +- Aggregate [Origin#155, UniqueCarrier#147], [Origin#155, 
UniqueCarrier#147, count(1) AS total#97L]
  +- Project [UniqueCarrier#147, Origin#155]
 +- Filter (((isnotnull(UniqueCarrier#147) && isnotnull(Origin#155)) && 
isnotnull(Dest#156)) && (Dest#156 = ORD))
+- 
Relation[Year#139,Month#140,DayofMonth#141,DayOfWeek#142,DepTime#143,CRSDepTime#144,ArrTime#145,CRSArrTime#146,UniqueCarrier#147,FlightNum#148,TailNum#149,ActualElapsedTime#150,CRSElapsedTime#151,AirTime#152,ArrDelay#153,DepDelay#154,Origin#155,Dest#156,Distance#157,TaxiIn#158,TaxiOut#159,Cancelled#160,CancellationCode#161,Diverted#162,...
 5 more fields] csv

== Physical Plan ==
*Project [Origin#16, UniqueCarrier#8, round((cast((count#

[jira] [Updated] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15951:
--
 Priority: Minor  (was: Major)
Fix Version/s: (was: 2.1.0)
  Component/s: (was: Spark Core)
   Web UI

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Priority: Minor
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15631) Dataset and encoder bug fixes

2016-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15631:
--
Assignee: Wenchen Fan

> Dataset and encoder bug fixes
> -
>
> Key: SPARK-15631
> URL: https://issues.apache.org/jira/browse/SPARK-15631
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> This is an umbrella ticket for various Dataset and encoder bug fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15065) HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky

2016-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15065:
--
Assignee: Pete Robbins

> HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky
> -
>
> Key: SPARK-15065
> URL: https://issues.apache.org/jira/browse/SPARK-15065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Assignee: Pete Robbins
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: log.txt
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/861/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/dir/
> There are several WARN messages like {{16/05/02 00:51:06 WARN Master: Got 
> status update for unknown executor app-20160502005054-/3}}, which are 
> suspicious. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR

2016-06-15 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331439#comment-15331439
 ] 

Dongjoon Hyun commented on SPARK-15908:
---

I'll working on this issue~.

> Add varargs-type dropDuplicates() function in SparkR
> 
>
> Key: SPARK-15908
> URL: https://issues.apache.org/jira/browse/SPARK-15908
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is for API parity of Scala API. Refer to 
> https://issues.apache.org/jira/browse/SPARK-15807



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15518) Rename various scheduler backend for consistency

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331455#comment-15331455
 ] 

Apache Spark commented on SPARK-15518:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13683

> Rename various scheduler backend for consistency
> 
>
> Key: SPARK-15518
> URL: https://issues.apache.org/jira/browse/SPARK-15518
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Various scheduler backends are not named consistently, making it difficult to 
> understand what they do based on the names. It would be great to rename some 
> of them:
> - LocalScheduler -> LocalSchedulerBackend
> - AppClient -> StandaloneAppClient
> - AppClientListener -> StandaloneAppClientListener
> - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
> - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
> - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331488#comment-15331488
 ] 

Apache Spark commented on SPARK-15908:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13684

> Add varargs-type dropDuplicates() function in SparkR
> 
>
> Key: SPARK-15908
> URL: https://issues.apache.org/jira/browse/SPARK-15908
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is for API parity of Scala API. Refer to 
> https://issues.apache.org/jira/browse/SPARK-15807



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15908:


Assignee: Apache Spark

> Add varargs-type dropDuplicates() function in SparkR
> 
>
> Key: SPARK-15908
> URL: https://issues.apache.org/jira/browse/SPARK-15908
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> This is for API parity of Scala API. Refer to 
> https://issues.apache.org/jira/browse/SPARK-15807



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15908:


Assignee: (was: Apache Spark)

> Add varargs-type dropDuplicates() function in SparkR
> 
>
> Key: SPARK-15908
> URL: https://issues.apache.org/jira/browse/SPARK-15908
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is for API parity of Scala API. Refer to 
> https://issues.apache.org/jira/browse/SPARK-15807



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR

2016-06-15 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331439#comment-15331439
 ] 

Dongjoon Hyun edited comment on SPARK-15908 at 6/15/16 10:16 AM:
-

I'll work on this issue~.


was (Author: dongjoon):
I'll working on this issue~.

> Add varargs-type dropDuplicates() function in SparkR
> 
>
> Key: SPARK-15908
> URL: https://issues.apache.org/jira/browse/SPARK-15908
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>
> This is for API parity of Scala API. Refer to 
> https://issues.apache.org/jira/browse/SPARK-15807



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-15 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331580#comment-15331580
 ] 

Pete Robbins commented on SPARK-15822:
--

I can also recreate this issue on Oracle JDK 1.8:

{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f0c65d06aec, pid=7521, tid=0x7f0b69ffd700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# J 7453 C1 org.apache.spark.unsafe.Platform.getByte(Ljava/lang/Object;J)B (9 
bytes) @ 0x7f0c65d06aec [0x7f0c65d06ae0+0xc]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---  T H R E A D  ---

Current thread (0x7f0bf4008800):  JavaThread "Executor task launch 
worker-3" daemon [_thread_in_Java, id=7662, 
stack(0x7f0b69efd000,0x7f0b69ffe000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 
0x02868e54

Registers:
RAX=0x7f0c461abb38, RBX=0x7f0c461abb38, RCX=0x7f0c213547c8, 
RDX=0x02868e54
RSP=0x7f0b69ffba40, RBP=0x7f0b69ffbae0, RSI=0x, 
RDI=0x0001008254d8
R8 =0x200bd0a6, R9 =0xd9fa2650, R10=0x7f0c79d39020, 
R11=0x7f0c65d06ae0
R12=0x, R13=0x7f0b69ffba88, R14=0x7f0b69ffbaf8, 
R15=0x7f0bf4008800
RIP=0x7f0c65d06aec, EFLAGS=0x00010202, CSGSFS=0x0033, 
ERR=0x0004
  TRAPNO=0x000e

Top of Stack: (sp=0x7f0b69ffba40)
0x7f0b69ffba40:   7f0b684b4a70 
0x7f0b69ffba50:   7f0b69ffbb10 7f0c65e96d4c
0x7f0b69ffba60:   7f0c65008040 d9fa2628
0x7f0b69ffba70:   7f0b69ffbae0 7f0c650079c0
0x7f0b69ffba80:   7f0c650079c0 02868e54
0x7f0b69ffba90:   0030 
0x7f0b69ffbaa0:   7f0b69ffbaa0 7f0c21351403
0x7f0b69ffbab0:   7f0b69ffbaf8 7f0c213547c8
0x7f0b69ffbac0:    7f0c21351428
0x7f0b69ffbad0:   7f0b69ffba88 7f0b69ffbaf0
0x7f0b69ffbae0:   7f0b69ffbb48 7f0c650079c0
0x7f0b69ffbaf0:    d9f57cf0
0x7f0b69ffbb00:   004c 7f0b69ffbb08
0x7f0b69ffbb10:   7f0c21353726 7f0b69ffbb78
0x7f0b69ffbb20:   7f0c213547c8 
0x7f0b69ffbb30:   7f0c213537a0 7f0b69ffbaf0
0x7f0b69ffbb40:   7f0b69ffbb70 7f0b69ffbbc0
0x7f0b69ffbb50:   7f0c65007d00 
0x7f0b69ffbb60:    0003
0x7f0b69ffbb70:   d9f57cf0 d9fa33b0
0x7f0b69ffbb80:   7f0b69ffbb80 7f0c2135385a
0x7f0b69ffbb90:   7f0b69ffbbd8 7f0c213547c8
0x7f0b69ffbba0:    7f0c21353880
0x7f0b69ffbbb0:   7f0b69ffbb70 7f0b69ffbbd0
0x7f0b69ffbbc0:   7f0b69ffbc20 7f0c65007d00
0x7f0b69ffbbd0:   d9f57cf0 d9fa33b0
0x7f0b69ffbbe0:   7f0b69ffbbe0 7f0b684a24e5
0x7f0b69ffbbf0:   7f0b69ffbc88 7f0b684a2950
0x7f0b69ffbc00:    7f0b684a2618
0x7f0b69ffbc10:   7f0b69ffbbd0 7f0b69ffbc78
0x7f0b69ffbc20:   7f0b69ffbcd0 7f0c65007a90
0x7f0b69ffbc30:     

Instructions: (pc=0x7f0c65d06aec)
0x7f0c65d06acc:   0a 80 11 64 01 f8 12 fe 06 90 0c 64 01 f8 12 fe
0x7f0c65d06adc:   06 90 0c 64 89 84 24 00 c0 fe ff 55 48 83 ec 30
0x7f0c65d06aec:   0f be 04 16 c1 e0 18 c1 f8 18 48 83 c4 30 5d 85
0x7f0c65d06afc:   05 ff f5 28 14 c3 90 90 49 8b 87 a8 02 00 00 49 

Register to memory mapping:

RAX={method} {0x7f0c461abb38} 'getByte' '(Ljava/lang/Object;J)B' in 
'org/apache/spark/unsafe/Platform'
RBX={method} {0x7f0c461abb38} 'getByte' '(Ljava/lang/Object;J)B' in 
'org/apache/spark/unsafe/Platform'
RCX=0x7f0c213547c8 is pointing into metadata
RDX=0x02868e54 is an unknown value
RSP=0x7f0b69ffba40 is pointing into the stack for thread: 0x7f0bf4008800
RBP=0x7f0b69ffbae0 is pointing into the stack for thread: 0x7f0bf4008800
RSI=0x is an unknown value
RDI=0x0001008254d8 is pointing into metadata
R8 =0x200bd0a6 is an unknown value
R9 =0xd9fa2650 is an oop
[B 
 - klass: {type array byte}
 - length: 48
R10=0x7f0c79d39020:  in 
/home/robbins/sdks/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so at 
0x7f0c78d7d000
R11=0x7f0c65d06ae0 is at entry_point+0 in (nmethod*)0x7f0c65d06990
R12=0x is an unknown value
R13=0x7f0b69ffba88 is 

[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end

2016-06-15 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331613#comment-15331613
 ] 

Dongjoon Hyun commented on SPARK-14692:
---

Hi, [~nmolkeri].

It seems that you just needed `Sys.setenv(SPARK_HOME="/Users/yourid/spark")` at 
the first line at that time.

It's a little bit old issue.

If there is no further comments, I think we had better close this.

> Error While Setting the path for R front end
> 
>
> Key: SPARK-14692
> URL: https://issues.apache.org/jira/browse/SPARK-14692
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Mac OSX
>Reporter: Niranjan Molkeri`
>
> Trying to set Environment path for SparkR in RStudio. 
> Getting this bug. 
> > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> > library(SparkR)
> Error in library(SparkR) : there is no package called ‘SparkR’
> > sc <- sparkR.init(master="local")
> Error: could not find function "sparkR.init"
> In the directory which it is pointed. There is directory called SparkR. I 
> don't know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-15046.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-15046:
--
Assignee: Marcelo Vanzin

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-15 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-15963:
-

 Summary: `TaskKilledException` is not correctly caught in 
`Executor.TaskRunner`
 Key: SPARK-15963
 URL: https://issues.apache.org/jira/browse/SPARK-15963
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 2.1.0
Reporter: Liwei Lin


Currently in {{Executor.TaskRunner}}, we:

{code}
try {...}
catch {
  case _: TaskKilledException | _: InterruptedException if task.killed =>
  ...
}
{code}
What we intended was:
- {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})

But fact is:
- ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}

As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
incorrectly report our task status as {{FAILED}} (which should really be 
{{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15963:


Assignee: Apache Spark

> `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
> --
>
> Key: SPARK-15963
> URL: https://issues.apache.org/jira/browse/SPARK-15963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>
> Currently in {{Executor.TaskRunner}}, we:
> {code}
> try {...}
> catch {
>   case _: TaskKilledException | _: InterruptedException if task.killed =>
>   ...
> }
> {code}
> What we intended was:
> - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})
> But fact is:
> - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}
> As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
> incorrectly report our task status as {{FAILED}} (which should really be 
> {{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15963:


Assignee: (was: Apache Spark)

> `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
> --
>
> Key: SPARK-15963
> URL: https://issues.apache.org/jira/browse/SPARK-15963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Liwei Lin
>
> Currently in {{Executor.TaskRunner}}, we:
> {code}
> try {...}
> catch {
>   case _: TaskKilledException | _: InterruptedException if task.killed =>
>   ...
> }
> {code}
> What we intended was:
> - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})
> But fact is:
> - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}
> As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
> incorrectly report our task status as {{FAILED}} (which should really be 
> {{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331837#comment-15331837
 ] 

Apache Spark commented on SPARK-15963:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13685

> `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
> --
>
> Key: SPARK-15963
> URL: https://issues.apache.org/jira/browse/SPARK-15963
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Liwei Lin
>
> Currently in {{Executor.TaskRunner}}, we:
> {code}
> try {...}
> catch {
>   case _: TaskKilledException | _: InterruptedException if task.killed =>
>   ...
> }
> {code}
> What we intended was:
> - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}})
> But fact is:
> - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}}
> As a consequence, sometimes we can not catch {{TaskKilledException}} and will 
> incorrectly report our task status as {{FAILED}} (which should really be 
> {{KILLED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Sanjay Dasgupta (JIRA)
Sanjay Dasgupta created SPARK-15964:
---

 Summary: Assignment to RDD-typed val fails
 Key: SPARK-15964
 URL: https://issues.apache.org/jira/browse/SPARK-15964
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: Notebook on Databricks Community-Edition 
Spark-2.0 preview
Google Chrome Browser
Linux Ubuntu 14.04 LTS
Reporter: Sanjay Dasgupta


Unusual assignment error, giving the following error message:

found : org.apache.spark.rdd.RDD[Name]
required : org.apache.spark.rdd.RDD[Name]

This occurs when the assignment is attempted in a cell that is different from 
the cell in which the item on the right-hand-side is defined. As in the 
following example:

// CELL-1
import org.apache.spark.sql.Dataset
import org.apache.spark.rdd.RDD

case class Name(number: Int, name: String)
val names = Seq(Name(1, "one"), Name(2, "two"), Name(3, "three"), Name(4, 
"four"))
val dataset: Dataset[Name] = spark.sparkContext.parallelize(names).toDF.as[Name]

// CELL-2
// Error reported here ...
val dataRdd: RDD[Name] = dataset.rdd

The error is reported in CELL-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-06-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331907#comment-15331907
 ] 

Imran Rashid commented on SPARK-15815:
--

yes, I agree that SPARK-15865 isn't ideal on its own.  SPARK-8426 will add 
better blacklisting which will help some.  And then as a follow up after that, 
I intend to add actively killing blacklisted executors.  But I actually think 
it won't change things -- we'll still abort the taskset when we first discover 
a task that can't be scheduled, because even with Dynamic Allocation, we'll 
never really know if we're going to get another executor.  Its not ideal, but I 
think the first step is to be sure we're preventing the app from hanging.

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1

2016-06-15 Thread thauvin damien (JIRA)
thauvin damien created SPARK-15965:
--

 Summary: No FileSystem for scheme: s3n or s3a  spark-2.0.0 and 
spark-1.6.1
 Key: SPARK-15965
 URL: https://issues.apache.org/jira/browse/SPARK-15965
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.6.1
 Environment: Debian GNU/Linux 8
java version "1.7.0_79"
Reporter: thauvin damien


The spark programming-guide explain that Spark can create distributed datasets 
on Amazon S3 . 
But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or s3a.
Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with 
hadoop.7.2 .
I understand this is an Hadoop Issue (SPARK-7442)  but can you make some 
documentation to explain what jar we need to add and where ? ( for standalone 
installation) .
"hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? 
What env variable we need to set and what file we need to modifiy .
Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable 
"spark.driver.extraClassPath" and "spark.executor.extraClassPath"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15966) Fix markdown for Spark Monitoring

2016-06-15 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-15966:


 Summary: Fix markdown for Spark Monitoring
 Key: SPARK-15966
 URL: https://issues.apache.org/jira/browse/SPARK-15966
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.0
Reporter: Dhruve Ashar
Priority: Trivial


The markdown for Spark monitoring needs to be fixed. 
http://spark.apache.org/docs/2.0.0-preview/monitoring.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors

2016-06-15 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331975#comment-15331975
 ] 

Bryan Cutler commented on SPARK-15861:
--

If you change your function to this

{noformat}
def to_np(data):
return [np.array(list(data))]
{noformat}

I think you would get what you expect, but this is probably not a good way to 
go about it.  I feel like you should be aggregating your lists into numpy 
arrays instead, but someone else might know better.

> pyspark mapPartitions with none generator functions / functors
> --
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Greg Bowyer
>Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1

2016-06-15 Thread thauvin damien (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

thauvin damien updated SPARK-15965:
---
Description: 
The spark programming-guide explain that Spark can create distributed datasets 
on Amazon S3 . 
But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or s3a. 

sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH")
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", 
"xxx")
val lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz")

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3a.S3AFileSystem not found

Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with 
hadoop.7.2 .
I understand this is an Hadoop Issue (SPARK-7442)  but can you make some 
documentation to explain what jar we need to add and where ? ( for standalone 
installation) .
"hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? 
What env variable we need to set and what file we need to modifiy .
Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable 
"spark.driver.extraClassPath" and "spark.executor.extraClassPath"

But Still Works with spark-1.6.1 pre build with hadoop2.4 

Thanks 

  was:
The spark programming-guide explain that Spark can create distributed datasets 
on Amazon S3 . 
But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or s3a.
Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with 
hadoop.7.2 .
I understand this is an Hadoop Issue (SPARK-7442)  but can you make some 
documentation to explain what jar we need to add and where ? ( for standalone 
installation) .
"hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? 
What env variable we need to set and what file we need to modifiy .
Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable 
"spark.driver.extraClassPath" and "spark.executor.extraClassPath"


> No FileSystem for scheme: s3n or s3a  spark-2.0.0 and spark-1.6.1
> -
>
> Key: SPARK-15965
> URL: https://issues.apache.org/jira/browse/SPARK-15965
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.1
> Environment: Debian GNU/Linux 8
> java version "1.7.0_79"
>Reporter: thauvin damien
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> The spark programming-guide explain that Spark can create distributed 
> datasets on Amazon S3 . 
> But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or 
> s3a. 
> sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH")
> sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", 
> "xxx")
> val 
> lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz")
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
> Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with 
> hadoop.7.2 .
> I understand this is an Hadoop Issue (SPARK-7442)  but can you make some 
> documentation to explain what jar we need to add and where ? ( for standalone 
> installation) .
> "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? 
> What env variable we need to set and what file we need to modifiy .
> Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable 
> "spark.driver.extraClassPath" and "spark.executor.extraClassPath"
> But Still Works with spark-1.6.1 pre build with hadoop2.4 
> Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15849) FileNotFoundException on _temporary while doing saveAsTable to S3

2016-06-15 Thread Sandeep (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332034#comment-15332034
 ] 

Sandeep commented on SPARK-15849:
-

Thanks Thomas for that comment. I have verified both the things with direct 
committer :
1. The inconsistency issue no longer occurs
2. I see a 2x speedup too
Looking forward to the fix directly in Hadoop, so that the knob doesnt have to 
be explicitly set

> FileNotFoundException on _temporary while doing saveAsTable to S3
> -
>
> Key: SPARK-15849
> URL: https://issues.apache.org/jira/browse/SPARK-15849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: AWS EC2 with spark on yarn and s3 storage
>Reporter: Sandeep
>
> When submitting spark jobs to yarn cluster, I occasionally see these error 
> messages while doing saveAsTable. I have tried doing this with 
> spark.speculation=false, and get the same error. These errors are similar to 
> SPARK-2984, but my jobs are writing to S3(s3n) :
> Caused by: java.io.FileNotFoundException: File 
> s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist.
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
> ... 42 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15960) Audit new SQL confs

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15960:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15426

> Audit new SQL confs 
> 
>
> Key: SPARK-15960
> URL: https://issues.apache.org/jira/browse/SPARK-15960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> Check the current SQL configuration names for inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15426) Spark 2.0 SQL API audit

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15426.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Spark 2.0 SQL API audit
> ---
>
> Key: SPARK-15426
> URL: https://issues.apache.org/jira/browse/SPARK-15426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> This is an umbrella ticket to list issues I found with APIs for the 2.0 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15960) Audit new SQL confs

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15960.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Audit new SQL confs 
> 
>
> Key: SPARK-15960
> URL: https://issues.apache.org/jira/browse/SPARK-15960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> Check the current SQL configuration names for inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15967) Spark UI should show dynamically changed value of storage memory instead of showing one static value all the time

2016-06-15 Thread Umesh K (JIRA)
Umesh K created SPARK-15967:
---

 Summary: Spark UI should show dynamically changed value of storage 
memory instead of showing one static value all the time
 Key: SPARK-15967
 URL: https://issues.apache.org/jira/browse/SPARK-15967
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.6.1, 1.6.0
Reporter: Umesh K
Priority: Minor


As of Spark 1.6.x we have unified memory management and hence execution/storage 
memory changes over time for e.g. if execution grows it will take memory from 
storage and vice-versa. But if we Spark UI shows one value all the time in 
storage tab like it used to in previous version. Ideally storage memory values 
should get refreshed to show real time value of storage memory in the Spark UI 
so we can actually visualize that stealing between execution and storage is 
happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-15 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332074#comment-15332074
 ] 

Michael Allman commented on SPARK-15968:


I have a patched based on the way this method was implemented in Spark 1.5. I'm 
working on a PR presently.

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>  Labels: hive, metastore
>
> The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` 
> from the metastore relation's catalog table. This only returns the table base 
> path, which is not correct for partitioned tables. As a result, cache lookups 
> on partitioned tables always miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-15 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-15968:
---
Description: The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
{{pathsInMetastore}} from the metastore relation's catalog table. This only 
returns the table base path, which is not correct for partitioned tables. As a 
result, cache lookups on partitioned tables always miss and these relations are 
always recomputed.  (was: The `getCached` method of `HiveMetastoreCatalog` 
computes `pathsInMetastore` from the metastore relation's catalog table. This 
only returns the table base path, which is not correct for partitioned tables. 
As a result, cache lookups on partitioned tables always miss and these 
relations are always recomputed.)

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>  Labels: hive, metastore
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for partitioned tables. As 
> a result, cache lookups on partitioned tables always miss and these relations 
> are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-15 Thread Michael Allman (JIRA)
Michael Allman created SPARK-15968:
--

 Summary: HiveMetastoreCatalog does not correctly validate 
partitioned metastore relation when searching the internal table cache
 Key: SPARK-15968
 URL: https://issues.apache.org/jira/browse/SPARK-15968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Michael Allman


The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` 
from the metastore relation's catalog table. This only returns the table base 
path, which is not correct for partitioned tables. As a result, cache lookups 
on partitioned tables always miss and these relations are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15967) Spark UI should show realtime value of storage memory instead of showing one static value all the time

2016-06-15 Thread Umesh K (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh K updated SPARK-15967:

Summary: Spark UI should show realtime value of storage memory instead of 
showing one static value all the time  (was: Spark UI should show dynamically 
changed value of storage memory instead of showing one static value all the 
time)

> Spark UI should show realtime value of storage memory instead of showing one 
> static value all the time
> --
>
> Key: SPARK-15967
> URL: https://issues.apache.org/jira/browse/SPARK-15967
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Umesh K
>Priority: Minor
>
> As of Spark 1.6.x we have unified memory management and hence 
> execution/storage memory changes over time for e.g. if execution grows it 
> will take memory from storage and vice-versa. But if we Spark UI shows one 
> value all the time in storage tab like it used to in previous version. 
> Ideally storage memory values should get refreshed to show real time value of 
> storage memory in the Spark UI so we can actually visualize that stealing 
> between execution and storage is happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332123#comment-15332123
 ] 

Apache Spark commented on SPARK-15968:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/13686

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>  Labels: hive, metastore
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for partitioned tables. As 
> a result, cache lookups on partitioned tables always miss and these relations 
> are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15968:


Assignee: (was: Apache Spark)

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>  Labels: hive, metastore
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for partitioned tables. As 
> a result, cache lookups on partitioned tables always miss and these relations 
> are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15968:


Assignee: Apache Spark

> HiveMetastoreCatalog does not correctly validate partitioned metastore 
> relation when searching the internal table cache
> ---
>
> Key: SPARK-15968
> URL: https://issues.apache.org/jira/browse/SPARK-15968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Apache Spark
>  Labels: hive, metastore
>
> The {{getCached}} method of {{HiveMetastoreCatalog}} computes 
> {{pathsInMetastore}} from the metastore relation's catalog table. This only 
> returns the table base path, which is not correct for partitioned tables. As 
> a result, cache lookups on partitioned tables always miss and these relations 
> are always recomputed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Priority: Blocker  (was: Critical)

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15931) SparkR tests failing on R 3.3.0

2016-06-15 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15931.
---
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/13636

> SparkR tests failing on R 3.3.0
> ---
>
> Key: SPARK-15931
> URL: https://issues.apache.org/jira/browse/SPARK-15931
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Felix Cheung
> Fix For: 2.0.0
>
>
> Environment:
> # Spark master Git revision: 
> [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788]
> # R version: 3.3.0
> To reproduce this, just build Spark with {{-Psparkr}} and run the tests. 
> Relevant log lines:
> {noformat}
> ...
> Failed 
> -
> 1. Failure: Check masked functions (@test_context.R#44) 
> 
> length(maskedCompletely) not equal to length(namesOfMaskedCompletely).
> 1/1 mismatches
> [1] 3 - 5 == -2
> 2. Failure: Check masked functions (@test_context.R#45) 
> 
> sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely).
> Lengths differ: 3 vs 5
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15811:
--

Assignee: Davies Liu

> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15953.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-15 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332185#comment-15332185
 ] 

Miao Wang commented on SPARK-15784:
---

[~josephkb][~mengxr][~yanboliang] I am trying to add PIC to spark.ml and I have 
some questions regarding model.predict and saveImpl. The basic PIC algorithm 
has the following steps:

Input: A row-normalized affinity matrix W and the number of clusters k
Output: Clusters C1, C2, …, Ck

Pick an initial vector v0
Repeat
Set vt+1 ← Wvt
Set δt+1 ← |vt+1 – vt|
Increment t
Stop when |δt – δt-1| ≈ 0
Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

In the last step, k-means takes the pseudo-eigenvector `v ` generated by PIC to 
do the classification. Therefore, the model.predict should use the trained 
k-means to do the prediction. However, the vector `v` should run PIC again on 
the data to be predicted. So, there is no trained model for predicting new data 
set. model.predict is actually training again using the PIC.fit method. In this 
case, PIC.fit and PIC.predict actually call the same run method in MLLib 
implementation. 

Since we have to train data anyway, the model save is not useful as there is no 
model to be save. In the MLLib implementation, save function saves the 
assignment results of the current data set, which can't be used for new data 
clustering. The only usage of the result is when the same data is given, we 
don't have to train again. However, we don't know whether it is the previous 
training data from the saved model.

Please correct me if I misunderstand anything. Thanks!

Miao




> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15969) FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit

2016-06-15 Thread Kun Liu (JIRA)
Kun Liu created SPARK-15969:
---

 Summary: FileNotFoundException: Multiple arguments for py-files 
flag, (also jars) for spark-submit
 Key: SPARK-15969
 URL: https://issues.apache.org/jira/browse/SPARK-15969
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.6.1, 1.5.0
 Environment: Mac OS X 10.11.5
Reporter: Kun Liu
Priority: Minor


First time to open a JIRA issue. Newbie to the Spark community. Correct me if I 
was wrong. Thanks.

An exception, java.io.FileNotFoundException, happened when multiple arguments 
were specified for the -py-files (also -jars) flag.
I searched for a while but only found a similar issue on Windows OS: 
https://issues.apache.org/jira/browse/SPARK-6435
My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1

1.1 Observations:
1) Quotation does not make any difference for the arguments, the result will 
always be the same
2) The first path before comma, as long as valid, won’t be a problem whether it 
is an absolute or a relative path
3) The second and further py-files paths won’t be a problem if ALL of them are:
a. are relative paths under the same directory as the working directory 
($PWD); OR
b. specified by using environment variable at the beginning, e.g. 
$ENV_VAR/path/to/file; OR
c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter 
absolute or relative paths, as long as valid
4) The path of the driver program, assuming valid, does not matter, as it is a 
single file

1.2 Experiments:

Assuming main.py calls functions from helper1.py and helper2.py, and all paths 
below are valid.
~/Desktop/testpath: main.py, helper1.py, helper2.py
$SPARK_HOME/testpath: helper1.py, helper2.py

1) Successful output:
a. Multiple python paths are relative paths under the same directory as 
the working directory

cd $SPARK_HOME
bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py 
~/Desktop/testpath/main.py

cd ~/Desktop
$SPARK_HOME/bin/spark-submit --py-files 
testpath/helper1.py,testpath/helper2.py testpath/main.py

b. Multiple python paths are specified by using environment variable

export TEST_DIR=~/Desktop/testpath
cd ~
$SPARK_HOME/bin/spark-submit --py-files 
$TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py

cd ~/Documents
$SPARK_HOME/bin/spark-submit --py-files 
$TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py

c. Multiple paths (absolute or relative) after being preprocessed:

$SPARK_HOME/bin/spark-submit --py-files $(echo 
$SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py 

cd ~/Desktop
$SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr 
' ' ',') ~/Desktop/testpath/main.py 


(reference link: 
http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option)

2) Failure output: if the second python path is an absolute one; the same 
problem will happen for further paths

cd ~/Documents
$SPARK_HOME/bin/spark-submit --py-files 
~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py 
~/Desktop/testpath/main.py 

py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: Added file 
file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist.

1.3 Conclusions

I would suggest the py-files flag of spark-submit could support all absolute 
paths arguments, not just relative path under the working directory.
If necessary, I would like to submit a pull request and start working on it as 
my first contribution to the Spark community.

1.4 Note
1) I think the same issue will happen when multiple jar files delimited by 
comma are passed to the —jars flag flag for Java applications.
2) I suggest wildcard paths arguments should also be supported, as indicated by 
https://issues.apache.org/jira/browse/SPARK-3451




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332224#comment-15332224
 ] 

Michael Armbrust commented on SPARK-15964:
--

Thanks for reporting this, but I believe this is actually specific to the 
databricks environment (i.e. it works in the spark shell).  The issue here is 
that there is a scala compiler bug and as far as we know, you have two choices:
 - path dependent types work (i.e. you can refer to a type from another cell in 
the next cell)
 - multi line (:paste mode in the spark shell) commands work with SQL implicits.

Many more workloads in notebooks depend on the latter, while the former is more 
common in the command line REPL.  This is why the behavior differs.  I'm hoping 
the scala 2.11 will give us the best of both worlds if we can fix 
https://issues.scala-lang.org/browse/SI-9799

> Assignment to RDD-typed val fails
> -
>
> Key: SPARK-15964
> URL: https://issues.apache.org/jira/browse/SPARK-15964
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Notebook on Databricks Community-Edition 
> Spark-2.0 preview
> Google Chrome Browser
> Linux Ubuntu 14.04 LTS
>Reporter: Sanjay Dasgupta
>
> Unusual assignment error, giving the following error message:
> found : org.apache.spark.rdd.RDD[Name]
> required : org.apache.spark.rdd.RDD[Name]
> This occurs when the assignment is attempted in a cell that is different from 
> the cell in which the item on the right-hand-side is defined. As in the 
> following example:
> // CELL-1
> import org.apache.spark.sql.Dataset
> import org.apache.spark.rdd.RDD
> case class Name(number: Int, name: String)
> val names = Seq(Name(1, "one"), Name(2, "two"), Name(3, "three"), Name(4, 
> "four"))
> val dataset: Dataset[Name] = 
> spark.sparkContext.parallelize(names).toDF.as[Name]
> // CELL-2
> // Error reported here ...
> val dataRdd: RDD[Name] = dataset.rdd
> The error is reported in CELL-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15964) Assignment to RDD-typed val fails

2016-06-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15964.
--
Resolution: Won't Fix

> Assignment to RDD-typed val fails
> -
>
> Key: SPARK-15964
> URL: https://issues.apache.org/jira/browse/SPARK-15964
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Notebook on Databricks Community-Edition 
> Spark-2.0 preview
> Google Chrome Browser
> Linux Ubuntu 14.04 LTS
>Reporter: Sanjay Dasgupta
>
> Unusual assignment error, giving the following error message:
> found : org.apache.spark.rdd.RDD[Name]
> required : org.apache.spark.rdd.RDD[Name]
> This occurs when the assignment is attempted in a cell that is different from 
> the cell in which the item on the right-hand-side is defined. As in the 
> following example:
> // CELL-1
> import org.apache.spark.sql.Dataset
> import org.apache.spark.rdd.RDD
> case class Name(number: Int, name: String)
> val names = Seq(Name(1, "one"), Name(2, "two"), Name(3, "three"), Name(4, 
> "four"))
> val dataset: Dataset[Name] = 
> spark.sparkContext.parallelize(names).toDF.as[Name]
> // CELL-2
> // Error reported here ...
> val dataRdd: RDD[Name] = dataset.rdd
> The error is reported in CELL-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero

2016-06-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332276#comment-15332276
 ] 

Thomas Graves commented on SPARK-15955:
---

what master and deploy mode are you using?

> Failed Spark application returns with exitcode equals to zero
> -
>
> Key: SPARK-15955
> URL: https://issues.apache.org/jira/browse/SPARK-15955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Set up cluster with wire-encryption enabled.
> * set 'spark.authenticate.enableSaslEncryption' = 'false' and 
> 'spark.shuffle.service.enabled' :'true'
> * run sparkPi application.
> {code}
> client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
> diagnostics: Max number of executor failures (3) reached
> ApplicationMaster host: xx.xx.xx.xxx
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1465941051976
> final status: FAILED
> tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/
> user: hrt_qa
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1465925772890_0016 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO ShutdownHookManager: Shutdown hook called{code}
> This spark application exits with exitcode = 0. Failed application should not 
> return with exitcode = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Summary: Python UDFs do not work in Spark 2.0-preview built with scala 2.10 
 (was: UDFs do not work in Spark 2.0-preview built with scala 2.10)

> Python UDFs do not work in Spark 2.0-preview built with scala 2.10
> --
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15935) Enable test for sql/streaming.py and fix these tests

2016-06-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15935.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Enable test for sql/streaming.py and fix these tests
> 
>
> Key: SPARK-15935
> URL: https://issues.apache.org/jira/browse/SPARK-15935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now tests  sql/streaming.py are disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode

2016-06-15 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15970:
--

 Summary: WARNing message related to persisting table to Hive 
Megastore while Spark SQL is running in-memory catalog mode
 Key: SPARK-15970
 URL: https://issues.apache.org/jira/browse/SPARK-15970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu
Priority: Minor


When we run Spark-shell in In-Memory catalog mode, creating a datasource table 
that is not compatible with hive will show a warning messaging saying it can 
not persist the table in hive compatible way. However, In-Memory catalog mode 
should not involve in trying to persist table in hive megastore at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15959) Add the support of hive.metastore.warehouse.dir back

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15959.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add the support of hive.metastore.warehouse.dir back
> 
>
> Key: SPARK-15959
> URL: https://issues.apache.org/jira/browse/SPARK-15959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Right now, we do not load the value of this value at all 
> (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSharedState.scala#L35-L41).
>  Let's maintain the backward compatibility by loading it if spark's warehouse 
> conf is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15969) FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit

2016-06-15 Thread Kun Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kun Liu updated SPARK-15969:

Remaining Estimate: 120h  (was: 168h)
 Original Estimate: 120h  (was: 168h)

> FileNotFoundException: Multiple arguments for py-files flag, (also jars) for 
> spark-submit
> -
>
> Key: SPARK-15969
> URL: https://issues.apache.org/jira/browse/SPARK-15969
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.1
> Environment: Mac OS X 10.11.5
>Reporter: Kun Liu
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> First time to open a JIRA issue. Newbie to the Spark community. Correct me if 
> I was wrong. Thanks.
> An exception, java.io.FileNotFoundException, happened when multiple arguments 
> were specified for the -py-files (also -jars) flag.
> I searched for a while but only found a similar issue on Windows OS: 
> https://issues.apache.org/jira/browse/SPARK-6435
> My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1
> 1.1 Observations:
> 1) Quotation does not make any difference for the arguments, the result will 
> always be the same
> 2) The first path before comma, as long as valid, won’t be a problem whether 
> it is an absolute or a relative path
> 3) The second and further py-files paths won’t be a problem if ALL of them 
> are:
>   a. are relative paths under the same directory as the working directory 
> ($PWD); OR
>   b. specified by using environment variable at the beginning, e.g. 
> $ENV_VAR/path/to/file; OR
>   c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter 
> absolute or relative paths, as long as valid
> 4) The path of the driver program, assuming valid, does not matter, as it is 
> a single file
> 1.2 Experiments:
> Assuming main.py calls functions from helper1.py and helper2.py, and all 
> paths below are valid.
> ~/Desktop/testpath: main.py, helper1.py, helper2.py
> $SPARK_HOME/testpath: helper1.py, helper2.py
> 1) Successful output:
>   a. Multiple python paths are relative paths under the same directory as 
> the working directory
>   cd $SPARK_HOME
>   bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py 
> ~/Desktop/testpath/main.py
>   cd ~/Desktop
>   $SPARK_HOME/bin/spark-submit --py-files 
> testpath/helper1.py,testpath/helper2.py testpath/main.py
>   b. Multiple python paths are specified by using environment variable
>   export TEST_DIR=~/Desktop/testpath
>   cd ~
>   $SPARK_HOME/bin/spark-submit --py-files 
> $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py
>   
>   cd ~/Documents
>   $SPARK_HOME/bin/spark-submit --py-files 
> $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py
>   c. Multiple paths (absolute or relative) after being preprocessed:
>   $SPARK_HOME/bin/spark-submit --py-files $(echo 
> $SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py 
>   cd ~/Desktop
>   $SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr 
> ' ' ',') ~/Desktop/testpath/main.py 
>   (reference link: 
> http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option)
> 2) Failure output: if the second python path is an absolute one; the same 
> problem will happen for further paths
>   cd ~/Documents
>   $SPARK_HOME/bin/spark-submit --py-files 
> ~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py 
> ~/Desktop/testpath/main.py 
>   py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
>   : java.io.FileNotFoundException: Added file 
> file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist.
> 1.3 Conclusions
> I would suggest the py-files flag of spark-submit could support all absolute 
> paths arguments, not just relative path under the working directory.
> If necessary, I would like to submit a pull request and start working on it 
> as my first contribution to the Spark community.
> 1.4 Note
> 1) I think the same issue will happen when multiple jar files delimited by 
> comma are passed to the —jars flag flag for Java applications.
> 2) I suggest wildcard paths arguments should also be supported, as indicated 
> by https://issues.apache.org/jira/browse/SPARK-3451



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13850) TimSort Comparison method violates its general contract

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13850.
-
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2

> TimSort Comparison method violates its general contract
> ---
>
> Key: SPARK-13850
> URL: https://issues.apache.org/jira/browse/SPARK-13850
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>Assignee: Sameer Agarwal
> Fix For: 1.6.2, 2.0.0
>
>
> While running a query which does a group by on a large dataset, the query 
> fails with following stack trace. 
> {code}
> Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most 
> recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, 
> hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison 
> method violates its general contract!
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Please note that the same query used to succeed in Spark 1.5 so it seems like 
> a regression in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15826) PipedRDD to allow configurable char encoding

2016-06-15 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15826:

Summary: PipedRDD to allow configurable char encoding  (was: PipedRDD to 
allow configurable char encoding (default: UTF-8))

> PipedRDD to allow configurable char encoding
> 
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Trivial
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15970:


Assignee: Apache Spark

> WARNing message related to persisting table to Hive Megastore while Spark SQL 
> is running in-memory catalog mode
> ---
>
> Key: SPARK-15970
> URL: https://issues.apache.org/jira/browse/SPARK-15970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>Priority: Minor
>
> When we run Spark-shell in In-Memory catalog mode, creating a datasource 
> table that is not compatible with hive will show a warning messaging saying 
> it can not persist the table in hive compatible way. However, In-Memory 
> catalog mode should not involve in trying to persist table in hive megastore 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15970:


Assignee: (was: Apache Spark)

> WARNing message related to persisting table to Hive Megastore while Spark SQL 
> is running in-memory catalog mode
> ---
>
> Key: SPARK-15970
> URL: https://issues.apache.org/jira/browse/SPARK-15970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Minor
>
> When we run Spark-shell in In-Memory catalog mode, creating a datasource 
> table that is not compatible with hive will show a warning messaging saying 
> it can not persist the table in hive compatible way. However, In-Memory 
> catalog mode should not involve in trying to persist table in hive megastore 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332331#comment-15332331
 ] 

Apache Spark commented on SPARK-15970:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/13687

> WARNing message related to persisting table to Hive Megastore while Spark SQL 
> is running in-memory catalog mode
> ---
>
> Key: SPARK-15970
> URL: https://issues.apache.org/jira/browse/SPARK-15970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Priority: Minor
>
> When we run Spark-shell in In-Memory catalog mode, creating a datasource 
> table that is not compatible with hive will show a warning messaging saying 
> it can not persist the table in hive compatible way. However, In-Memory 
> catalog mode should not involve in trying to persist table in hive megastore 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15826) PipedRDD to allow configurable char encoding

2016-06-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15826.
--
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.0.0

> PipedRDD to allow configurable char encoding
> 
>
> Key: SPARK-15826
> URL: https://issues.apache.org/jira/browse/SPARK-15826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Encountered an issue wherein the code works in some cluster but fails on 
> another one for the same input. After debugging realised that PipedRDD is 
> picking default char encoding from the JVM which may be different across 
> different platforms. Making it use UTF-8 encoding just like 
> `ScriptTransformation` does.
> Stack trace:
> {noformat}
> Caused by: java.nio.charset.MalformedInputException: Input length = 1
>   at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
>   at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15971:
-

 Summary: GroupedData's member incorrectly named
 Key: SPARK-15971
 URL: https://issues.apache.org/jira/browse/SPARK-15971
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0, 2.1.0
Reporter: Vladimir Feinberg
Priority: Trivial


The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as 
the member variable [[self._jdf]], which is exactly the same as 
[[pyspark.sql.DataFrame]], when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - 
in fact, in the [[DataFrame.groupBy]] implementation, the java object is 
referred to as exactly [[jgd]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15972:
-

 Summary: GroupedData varargs arguments misnamed
 Key: SPARK-15972
 URL: https://issues.apache.org/jira/browse/SPARK-15972
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0, 2.1.0
Reporter: Vladimir Feinberg
Priority: Trivial


Simple aggregation functions which take column names [[cols]] as varargs 
arguments show up in documentation with the argument [[args]], but their 
documentation refers to [[cols]].

The discrepancy is caused by an annotation, [[df_varargs_api]], which produces 
a temporary function with arguments [[args]] instead of [[cols]], creating the 
confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15973) GroupedData.pivot documentation off

2016-06-15 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15973:
-

 Summary: GroupedData.pivot documentation off
 Key: SPARK-15973
 URL: https://issues.apache.org/jira/browse/SPARK-15973
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0, 2.1.0
Reporter: Vladimir Feinberg
Priority: Trivial


{{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
python comments, which messes up formatting in the documentation as well as the 
doctests themselves.

A PR resolving this should probably resolve the other places this happens in 
pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15971:
--
Description: 
The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as 
the member variable {{self._jdf}}, which is exactly the same as 
{{pyspark.sql.DataFrame}}, when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - 
in fact, in the {{DataFrame.groupBy}} implementation, the java object is 
referred to as exactly {{jgd}}

  was:
The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as 
the member variable [[self._jdf]], which is exactly the same as 
[[pyspark.sql.DataFrame]], when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - 
in fact, in the [[DataFrame.groupBy]] implementation, the java object is 
referred to as exactly [[jgd]]


> GroupedData's member incorrectly named
> --
>
> Key: SPARK-15971
> URL: https://issues.apache.org/jira/browse/SPARK-15971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15972:
--
Description: 
Simple aggregation functions which take column names {{cols}} as varargs 
arguments show up in documentation with the argument {{args}}, but their 
documentation refers to {{cols}}.

The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces 
a temporary function with arguments {{args}} instead of {{cols}}, creating the 
confusing documentation.


  was:
Simple aggregation functions which take column names [[cols]] as varargs 
arguments show up in documentation with the argument [[args]], but their 
documentation refers to [[cols]].

The discrepancy is caused by an annotation, [[df_varargs_api]], which produces 
a temporary function with arguments [[args]] instead of [[cols]], creating the 
confusing documentation.


> GroupedData varargs arguments misnamed
> --
>
> Key: SPARK-15972
> URL: https://issues.apache.org/jira/browse/SPARK-15972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15973) GroupedData.pivot documentation off

2016-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332364#comment-15332364
 ] 

Sean Owen commented on SPARK-15973:
---

Please group your 3 JIRAs into one. They sound so similar that they should not 
be separate issues. I'll resolve 2 as duplicates shortly.

> GroupedData.pivot documentation off
> ---
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1

2016-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332369#comment-15332369
 ] 

Sean Owen commented on SPARK-15965:
---

CC [~steve_l] but I think this is your classpath issue or a Hadoop issue, not 
Spark.

> No FileSystem for scheme: s3n or s3a  spark-2.0.0 and spark-1.6.1
> -
>
> Key: SPARK-15965
> URL: https://issues.apache.org/jira/browse/SPARK-15965
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.1
> Environment: Debian GNU/Linux 8
> java version "1.7.0_79"
>Reporter: thauvin damien
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> The spark programming-guide explain that Spark can create distributed 
> datasets on Amazon S3 . 
> But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or 
> s3a. 
> sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH")
> sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", 
> "xxx")
> val 
> lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz")
> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
> Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with 
> hadoop.7.2 .
> I understand this is an Hadoop Issue (SPARK-7442)  but can you make some 
> documentation to explain what jar we need to add and where ? ( for standalone 
> installation) .
> "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? 
> What env variable we need to set and what file we need to modifiy .
> Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable 
> "spark.driver.extraClassPath" and "spark.executor.extraClassPath"
> But Still Works with spark-1.6.1 pre build with hadoop2.4 
> Thanks 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15973:
--
Summary: Fix GroupedData Documentation  (was: GroupedData.pivot 
documentation off)

> Fix GroupedData Documentation
> -
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15966) Fix markdown for Spark Monitoring

2016-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332371#comment-15332371
 ] 

Sean Owen commented on SPARK-15966:
---

Please make a more descriptive JIRA, and/or submit a PR.

> Fix markdown for Spark Monitoring
> -
>
> Key: SPARK-15966
> URL: https://issues.apache.org/jira/browse/SPARK-15966
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Dhruve Ashar
>Priority: Trivial
>
> The markdown for Spark monitoring needs to be fixed. 
> http://spark.apache.org/docs/2.0.0-preview/monitoring.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-15973:
--
Description: 
(1)

{{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
python comments, which messes up formatting in the documentation as well as the 
doctests themselves.

A PR resolving this should probably resolve the other places this happens in 
pyspark.

(2)

Simple aggregation functions which take column names {{cols}} as varargs 
arguments show up in documentation with the argument {{args}}, but their 
documentation refers to {{cols}}.

The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces 
a temporary function with arguments {{args}} instead of {{cols}}, creating the 
confusing documentation.

(3)

The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as 
the member variable {{self._jdf}}, which is exactly the same as 
{{pyspark.sql.DataFrame}}, when referring its object.

The acronym is incorrect, standing for "Java DataFrame" instead of what should 
be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - 
in fact, in the {{DataFrame.groupBy}} implementation, the java object is 
referred to as exactly {{jgd}}

  was:
{{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
python comments, which messes up formatting in the documentation as well as the 
doctests themselves.

A PR resolving this should probably resolve the other places this happens in 
pyspark.


> Fix GroupedData Documentation
> -
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> (1)
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.
> (2)
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.
> (3)
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg resolved SPARK-15972.
---
Resolution: Duplicate

> GroupedData varargs arguments misnamed
> --
>
> Key: SPARK-15972
> URL: https://issues.apache.org/jira/browse/SPARK-15972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15972) GroupedData varargs arguments misnamed

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg closed SPARK-15972.
-

> GroupedData varargs arguments misnamed
> --
>
> Key: SPARK-15972
> URL: https://issues.apache.org/jira/browse/SPARK-15972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg closed SPARK-15971.
-

> GroupedData's member incorrectly named
> --
>
> Key: SPARK-15971
> URL: https://issues.apache.org/jira/browse/SPARK-15971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15971) GroupedData's member incorrectly named

2016-06-15 Thread Vladimir Feinberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg resolved SPARK-15971.
---
Resolution: Duplicate

> GroupedData's member incorrectly named
> --
>
> Key: SPARK-15971
> URL: https://issues.apache.org/jira/browse/SPARK-15971
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-15947:
-

Assignee: Xiangrui Meng

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. Note that this includes 
> loading old saved models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15973) Fix GroupedData Documentation

2016-06-15 Thread Vladimir Feinberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332374#comment-15332374
 ] 

Vladimir Feinberg commented on SPARK-15973:
---

Done

> Fix GroupedData Documentation
> -
>
> Key: SPARK-15973
> URL: https://issues.apache.org/jira/browse/SPARK-15973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Vladimir Feinberg
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> (1)
> {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest 
> python comments, which messes up formatting in the documentation as well as 
> the doctests themselves.
> A PR resolving this should probably resolve the other places this happens in 
> pyspark.
> (2)
> Simple aggregation functions which take column names {{cols}} as varargs 
> arguments show up in documentation with the argument {{args}}, but their 
> documentation refers to {{cols}}.
> The discrepancy is caused by an annotation, {{df_varargs_api}}, which 
> produces a temporary function with arguments {{args}} instead of {{cols}}, 
> creating the confusing documentation.
> (3)
> The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around 
> as the member variable {{self._jdf}}, which is exactly the same as 
> {{pyspark.sql.DataFrame}}, when referring its object.
> The acronym is incorrect, standing for "Java DataFrame" instead of what 
> should be "Java GroupedData". As such, the name should be changed to 
> {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the 
> java object is referred to as exactly {{jgd}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15811:
---
Description: 
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 


  was:
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 



> Python UDFs do not work in Spark 2.0-preview built with scala 2.10
> --
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Assignee: Davies Liu
>Priority: Blocker
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3847) Enum.hashCode is only consistent within the same JVM

2016-06-15 Thread Brett Stime (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332453#comment-15332453
 ] 

Brett Stime commented on SPARK-3847:


Seems like, rather than a warning about the specifics of enums, the real fix 
(as mentioned in the highest voted answer to the question posted in the 
description--http://stackoverflow.com/a/4885292/93345) is to stop comparing 
hashCodes across distinct JVMs. In the worst case, perhaps the underlying keys 
should be deserialized in the target JVM and have their hashCodes recomputed. 
Seems like it should alternately work to create an implementation of hashCode 
and equals for the serialized bytes.

> Enum.hashCode is only consistent within the same JVM
> 
>
> Key: SPARK-3847
> URL: https://issues.apache.org/jira/browse/SPARK-3847
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Oracle JDK 7u51 64bit on Ubuntu 12.04
>Reporter: Nathan Bijnens
>  Labels: enum
>
> When using java Enum's as key in some operations the results will be very 
> unexpected. The issue is that the Java Enum.hashCode returns the 
> memoryposition, which is different on each JVM. 
> {code}
> messages.filter(_.getHeader.getKind == Kind.EVENT).count
> >> 503650
> val tmp = messages.filter(_.getHeader.getKind == Kind.EVENT)
> tmp.map(_.getHeader.getKind).countByValue
> >> Map(EVENT -> 1389)
> {code}
> Because it's actually a JVM issue we either should reject with an error enums 
> as key or implement a workaround.
> A good writeup of the issue can be found here (and a workaround):
> http://dev.bizo.com/2014/02/beware-enums-in-spark.html
> Somewhat more on the hash codes and Enum's:
> https://stackoverflow.com/questions/4885095/what-is-the-reason-behind-enum-hashcode
> And some issues (most of them rejected) at the Oracle Bug Java database:
> - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8050217
> - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7190798



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala

2016-06-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332475#comment-15332475
 ] 

Xusen Yin commented on SPARK-15574:
---

I just finished the prototype of PythonTransformer in Scala as the transformer 
wrapper of pure Python transformers. It works well if I run it alone from Scala 
side. But if I chained the PythonTransformer with other transformers/estimators 
in Pipeline, it fails for lacking of transformSchema in Python side. AFAIK, we 
need to add transformSchema in Python ML for pure Python PipelineStages. 
[~josephkb] [~mengxr]

> Python meta-algorithms in Scala
> ---
>
> Key: SPARK-15574
> URL: https://issues.apache.org/jira/browse/SPARK-15574
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This is an experimental idea for implementing Python ML meta-algorithms 
> (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala.  
> This would require a Scala wrapper for algorithms implemented in Python, 
> somewhat analogous to Python UDFs.
> The benefit of this change would be that we could avoid currently awkward 
> conversions between Scala/Python meta-algorithms required for persistence.  
> It would let us have full support for Python persistence and would generally 
> simplify the implementation within MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12492) SQL page of Spark-sql is always blank

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332509#comment-15332509
 ] 

Apache Spark commented on SPARK-12492:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/13689

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15888) Python UDF over aggregate fails

2016-06-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15888.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13682
[https://github.com/apache/spark/pull/13682]

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15767:


Assignee: Kai Jiang  (was: Apache Spark)

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.decisionTreeRegression(dataframe, formula, ...) .  After having 
> implemented decision tree classification, we could refactor this two into an 
> API more like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332519#comment-15332519
 ] 

Apache Spark commented on SPARK-15767:
--

User 'vectorijk' has created a pull request for this issue:
https://github.com/apache/spark/pull/13690

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.decisionTreeRegression(dataframe, formula, ...) .  After having 
> implemented decision tree classification, we could refactor this two into an 
> API more like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15767:


Assignee: Apache Spark  (was: Kai Jiang)

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Apache Spark
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.decisionTreeRegression(dataframe, formula, ...) .  After having 
> implemented decision tree classification, we could refactor this two into an 
> API more like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations

2016-06-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15457.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

I'm going to go ahead and close this.  We can create a new JIRA if there are 
more to clean up.

> Eliminate MLlib 2.0 build warnings from deprecations
> 
>
> Key: SPARK-15457
> URL: https://issues.apache.org/jira/browse/SPARK-15457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Several classes and methods have been deprecated and are creating lots of 
> build warnings in branch-2.0.  This issue is to identify and fix those items:
> * *WithSGD classes: Change to make class not deprecated, object deprecated, 
> and public class constructor deprecated.  Any public use will require a 
> deprecated API.  We need to keep a non-deprecated private API since we cannot 
> eliminate certain uses: Python API, streaming algs, and examples.
> ** Use in PythonMLlibAPI: Change to using private constructors
> ** Streaming algs: No warnings after we un-deprecate the classes
> ** Examples: Deprecate or change ones which use deprecated APIs
> * MulticlassMetrics fields (precision, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-15 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332527#comment-15332527
 ] 

Sean Zhong commented on SPARK-15786:


Basically, what you described can be shorten to:

{code}
scala> val ds = Seq((1,1) -> (1, 1)).toDS()
res4: org.apache.spark.sql.Dataset[((Int, Int), (Int, Int))] = [_1: struct<_1: 
int, _2: int>, _2: struct<_1: int, _2: int>]

scala> implicit val enc = Encoders.tuple(Encoders.kryo[Option[(Int, Int)]], 
Encoders.kryo[Option[(Int, Int)]])
enc: org.apache.spark.sql.Encoder[(Option[(Int, Int)], Option[(Int, Int)])] = 
class[_1[0]: binary, _2[0]: binary]

scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])].collect()
{code}

> joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
> -
>
> Key: SPARK-15786
> URL: https://issues.apache.org/jira/browse/SPARK-15786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Richard Marscher
>
> {code}java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 36, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates 
> are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", 
> "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, 
> int)"{code}
> I have been trying to use joinWith along with Option data types to get an 
> approximation of the RDD semantics for outer joins with Dataset to have a 
> nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode 
> generation trying to pass an InternalRow object into the ByteBuffer.wrap 
> function which expects byte[] with or without a couple int qualifiers.
> I have a notebook reproducing this against 2.0 preview in Databricks 
> Community Edition: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow

2016-06-15 Thread Sean Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332532#comment-15332532
 ] 

Sean Zhong commented on SPARK-15786:


The exception stack is:
{code}
scala> res4.as[(Option[(Int, Int)], Option[(Int, Int)])].collect()
16/06/15 13:46:18 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, 
Column 109: No applicable constructor/method found for actual parameters 
"org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)"
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificSafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private MutableRow mutableRow;
/* 009 */   private org.apache.spark.serializer.KryoSerializerInstance 
serializer;
/* 010 */   private org.apache.spark.serializer.KryoSerializerInstance 
serializer1;
/* 011 */
/* 012 */
/* 013 */   public SpecificSafeProjection(Object[] references) {
/* 014 */ this.references = references;
/* 015 */ mutableRow = (MutableRow) references[references.length - 1];
/* 016 */
/* 017 */ if (org.apache.spark.SparkEnv.get() == null) {
/* 018 */   serializer = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(new 
org.apache.spark.SparkConf()).newInstance();
/* 019 */ } else {
/* 020 */   serializer = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance();
/* 021 */ }
/* 022 */
/* 023 */
/* 024 */ if (org.apache.spark.SparkEnv.get() == null) {
/* 025 */   serializer1 = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(new 
org.apache.spark.SparkConf()).newInstance();
/* 026 */ } else {
/* 027 */   serializer1 = 
(org.apache.spark.serializer.KryoSerializerInstance) new 
org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance();
/* 028 */ }
/* 029 */
/* 030 */   }
/* 031 */
/* 032 */   public java.lang.Object apply(java.lang.Object _i) {
/* 033 */ InternalRow i = (InternalRow) _i;
/* 034 */
/* 035 */ boolean isNull2 = i.isNullAt(0);
/* 036 */ InternalRow value2 = isNull2 ? null : (i.getStruct(0, 2));
/* 037 */ final scala.Option value1 = isNull2 ? null : (scala.Option) 
serializer.deserialize(java.nio.ByteBuffer.wrap(value2), null);
/* 038 */
/* 039 */ boolean isNull4 = i.isNullAt(1);
/* 040 */ InternalRow value4 = isNull4 ? null : (i.getStruct(1, 2));
/* 041 */ final scala.Option value3 = isNull4 ? null : (scala.Option) 
serializer1.deserialize(java.nio.ByteBuffer.wrap(value4), null);
/* 042 */
/* 043 */
/* 044 */ final scala.Tuple2 value = false ? null : new 
scala.Tuple2(value1, value3);
/* 045 */ if (false) {
/* 046 */   mutableRow.setNullAt(0);
/* 047 */ } else {
/* 048 */
/* 049 */   mutableRow.update(0, value);
/* 050 */ }
/* 051 */
/* 052 */ return mutableRow;
/* 053 */   }
/* 054 */ }

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, 
Column 109: No applicable constructor/method found for actual parameters 
"org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static 
java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7559)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5663)
at org.codehaus.janino.UnitCompiler.access$13800(UnitCompiler.java:185)
at 
org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:5132)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3971)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159)
at 
org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7533)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429)
at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3873)
at org.codehaus.janino.UnitCompiler.access$6900

[jira] [Commented] (SPARK-1051) On Yarn, executors don't doAs as submitting user

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332539#comment-15332539
 ] 

Apache Spark commented on SPARK-1051:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/29

> On Yarn, executors don't doAs as submitting user
> 
>
> Key: SPARK-1051
> URL: https://issues.apache.org/jira/browse/SPARK-1051
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 0.9.0
>Reporter: Sandy Pérez González
>Assignee: Sandy Ryza
> Fix For: 0.9.1, 1.0.0
>
>
> This means that they can't write/read from files that the yarn user doesn't 
> have permissions to but the submitting user does.  I don't think this isn't a 
> problem when running with Hadoop security, because the executor processes 
> will be run as the submitting user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15906:
--
Issue Type: Improvement  (was: New Feature)

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15906:
--
Priority: Minor  (was: Major)

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332536#comment-15332536
 ] 

Joseph K. Bradley commented on SPARK-15906:
---

Can you provide more info about what the proposal does in this JIRA?  Also, do 
you have more references to indicate this is needed, such as other ML libraries 
with this improvement or other papers showing similar results?

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: MIN-FU YANG
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12978:

Target Version/s: 2.1.0  (was: 2.0.0)

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15715) Altering partition storage information doesn't work in Hive

2016-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15715.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Altering partition storage information doesn't work in Hive
> ---
>
> Key: SPARK-15715
> URL: https://issues.apache.org/jira/browse/SPARK-15715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> In HiveClientImpl
> {code}
>   private def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> new HivePartition(ht, p.spec.asJava, p.storage.locationUri.map { l => new 
> Path(l) }.orNull)
>   }
> {code}
> Other than the location, we don't even store any of the storage information 
> in the metastore: output format, input format, serde, serde props. The result 
> is that doing something like the following doesn't actually do anything:
> {code}
> ALTER TABLE boxes PARTITION (width=3)
> SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
> WITH SERDEPROPERTIES ('compress'='true')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >