[jira] [Assigned] (SPARK-15888) Python UDF over aggregate fails
[ https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15888: Assignee: Apache Spark (was: Davies Liu) > Python UDF over aggregate fails > --- > > Key: SPARK-15888 > URL: https://issues.apache.org/jira/browse/SPARK-15888 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg >Assignee: Apache Spark >Priority: Blocker > > This looks like a regression from 1.6.1. > The following notebook runs without error in a Spark 1.6.1 cluster, but fails > in 2.0.0: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15888) Python UDF over aggregate fails
[ https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331266#comment-15331266 ] Apache Spark commented on SPARK-15888: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/13682 > Python UDF over aggregate fails > --- > > Key: SPARK-15888 > URL: https://issues.apache.org/jira/browse/SPARK-15888 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg >Assignee: Davies Liu >Priority: Blocker > > This looks like a regression from 1.6.1. > The following notebook runs without error in a Spark 1.6.1 cluster, but fails > in 2.0.0: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15888) Python UDF over aggregate fails
[ https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15888: Assignee: Davies Liu (was: Apache Spark) > Python UDF over aggregate fails > --- > > Key: SPARK-15888 > URL: https://issues.apache.org/jira/browse/SPARK-15888 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg >Assignee: Davies Liu >Priority: Blocker > > This looks like a regression from 1.6.1. > The following notebook runs without error in a Spark 1.6.1 cluster, but fails > in 2.0.0: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator
[ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331285#comment-15331285 ] SuYan edited comment on SPARK-15815 at 6/15/16 7:17 AM: I see... although it can solve the hang problem, but for Dynamic Allocate, that solution seems a little rude, because spark have the ability to got another executors to complete 4 times failure or success finally was (Author: suyan): I see... although it can solve the gang problem, but for Dynamic Allocate, that solution seems a little rude, because spark have the ability to got another executors to complete 4 times failure or success finally > Hang while enable blacklistExecutor and DynamicExecutorAllocator > - > > Key: SPARK-15815 > URL: https://issues.apache.org/jira/browse/SPARK-15815 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > Enable BlacklistExecutor with some time large than 120s and enabled > DynamicAllocate with minExecutors = 0 > 1. Assume there only left 1 task running in Executor A, and other Executor > are all timeout. > 2. the task failed, so task will not scheduled in current Executor A due to > enable blacklistTime. > 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 > executors, due to we already have executor A, so the oldTargetNumExecutor == > targetNumExecutor = 1, so will never add more Executors...even if Executor A > was timeout. it became endless request delta=0 executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator
[ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331285#comment-15331285 ] SuYan commented on SPARK-15815: --- I see... although it can solve the gang problem, but for Dynamic Allocate, that solution seems a little rude, because spark have the ability to got another executors to complete 4 times failure or success finally > Hang while enable blacklistExecutor and DynamicExecutorAllocator > - > > Key: SPARK-15815 > URL: https://issues.apache.org/jira/browse/SPARK-15815 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > Enable BlacklistExecutor with some time large than 120s and enabled > DynamicAllocate with minExecutors = 0 > 1. Assume there only left 1 task running in Executor A, and other Executor > are all timeout. > 2. the task failed, so task will not scheduled in current Executor A due to > enable blacklistTime. > 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 > executors, due to we already have executor A, so the oldTargetNumExecutor == > targetNumExecutor = 1, so will never add more Executors...even if Executor A > was timeout. it became endless request delta=0 executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15922) BlockMatrix to IndexedRowMatrix throws an error
[ https://issues.apache.org/jira/browse/SPARK-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331294#comment-15331294 ] Dongjoon Hyun commented on SPARK-15922: --- Hi, [~mengxr]. Could you review the PR on this bug? > BlockMatrix to IndexedRowMatrix throws an error > --- > > Key: SPARK-15922 > URL: https://issues.apache.org/jira/browse/SPARK-15922 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Charlie Evans > > {code} > import org.apache.spark.mllib.linalg.distributed._ > import org.apache.spark.mllib.linalg._ > val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, > new DenseVector(Array(1,2,3))):: IndexedRow(2L, new > DenseVector(Array(1,2,3))):: Nil > val rdd = sc.parallelize(rows) > val matrix = new IndexedRowMatrix(rdd, 3, 3) > val bmat = matrix.toBlockMatrix > val imat = bmat.toIndexedRowMatrix > imat.rows.collect // this throws an error - Caused by: > java.lang.IllegalArgumentException: requirement failed: Vectors must be the > same length! > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15888) Python UDF over aggregate fails
[ https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15888: Affects Version/s: (was: 2.0.0) > Python UDF over aggregate fails > --- > > Key: SPARK-15888 > URL: https://issues.apache.org/jira/browse/SPARK-15888 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Vladimir Feinberg >Assignee: Davies Liu >Priority: Blocker > > This looks like a regression from 1.6.1. > The following notebook runs without error in a Spark 1.6.1 cluster, but fails > in 2.0.0: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1004) PySpark on YARN
[ https://issues.apache.org/jira/browse/SPARK-1004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331314#comment-15331314 ] Apache Spark commented on SPARK-1004: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/30 > PySpark on YARN > --- > > Key: SPARK-1004 > URL: https://issues.apache.org/jira/browse/SPARK-1004 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Josh Rosen >Assignee: Sandy Ryza >Priority: Blocker > Fix For: 1.0.0 > > > This is for tracking progress on supporting YARN in PySpark. > We might be able to use {{yarn-client}} mode > (https://spark.incubator.apache.org/docs/latest/running-on-yarn.html#launch-spark-application-with-yarn-client-mode). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero
[ https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331325#comment-15331325 ] Sean Owen commented on SPARK-15955: --- Which process has status 0, and can you try vs master? > Failed Spark application returns with exitcode equals to zero > - > > Key: SPARK-15955 > URL: https://issues.apache.org/jira/browse/SPARK-15955 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora > > Scenario: > * Set up cluster with wire-encryption enabled. > * set 'spark.authenticate.enableSaslEncryption' = 'false' and > 'spark.shuffle.service.enabled' :'true' > * run sparkPi application. > {code} > client token: Token { kind: YARN_CLIENT_TOKEN, service: } > diagnostics: Max number of executor failures (3) reached > ApplicationMaster host: xx.xx.xx.xxx > ApplicationMaster RPC port: 0 > queue: default > start time: 1465941051976 > final status: FAILED > tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/ > user: hrt_qa > Exception in thread "main" org.apache.spark.SparkException: Application > application_1465925772890_0016 finished with failed status > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > INFO ShutdownHookManager: Shutdown hook called{code} > This spark application exits with exitcode = 0. Failed application should not > return with exitcode = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331371#comment-15331371 ] Pete Robbins commented on SPARK-15822: -- The generated code is: {code} Top Arrival Carrier Cancellations: Found 5 WholeStageCodegen subtrees. == Subtree 1 / 5 == *HashAggregate(key=[Origin#16,UniqueCarrier#8], functions=[partial_count(1)], output=[Origin#16,UniqueCarrier#8,count#296L]) +- *Project [UniqueCarrier#8, Origin#16] +- *Filter (((isnotnull(Origin#16) && isnotnull(UniqueCarrier#8)) && isnotnull(Cancelled#21)) && isnotnull(CancellationCode#22)) && NOT (Cancelled#21 = 0)) && (CancellationCode#22 = A)) && isnotnull(Dest#17)) && (Dest#17 = ORD)) +- *Scan csv [UniqueCarrier#8,Origin#16,Dest#17,Cancelled#21,CancellationCode#22] Format: CSV, InputPaths: file:/home/robbins/brandberry/2008.csv, PushedFilters: [IsNotNull(Origin), IsNotNull(UniqueCarrier), IsNotNull(Cancelled), IsNotNull(CancellationCode), ..., ReadSchema: struct Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private boolean agg_initAgg; /* 008 */ private boolean agg_bufIsNull; /* 009 */ private long agg_bufValue; /* 010 */ private agg_VectorizedHashMap agg_vectorizedHashMap; /* 011 */ private java.util.Iterator agg_vectorizedHashMapIter; /* 012 */ private org.apache.spark.sql.execution.aggregate.HashAggregateExec agg_plan; /* 013 */ private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; /* 014 */ private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; /* 015 */ private org.apache.spark.unsafe.KVIterator agg_mapIter; /* 016 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_peakMemory; /* 017 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_spillSize; /* 018 */ private org.apache.spark.sql.execution.metric.SQLMetric scan_numOutputRows; /* 019 */ private scala.collection.Iterator scan_input; /* 020 */ private org.apache.spark.sql.execution.metric.SQLMetric filter_numOutputRows; /* 021 */ private UnsafeRow filter_result; /* 022 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder filter_holder; /* 023 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter filter_rowWriter; /* 024 */ private UnsafeRow project_result; /* 025 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; /* 026 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; /* 027 */ private UnsafeRow agg_result2; /* 028 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 029 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 030 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowJoiner agg_unsafeRowJoiner; /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric wholestagecodegen_numOutputRows; /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric wholestagecodegen_aggTime; /* 033 */ private UnsafeRow wholestagecodegen_result; /* 034 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder wholestagecodegen_holder; /* 035 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter wholestagecodegen_rowWriter; /* 036 */ /* 037 */ public GeneratedIterator(Object[] references) { /* 038 */ this.references = references; /* 039 */ } /* 040 */ /* 041 */ public void init(int index, scala.collection.Iterator inputs[]) { /* 042 */ partitionIndex = index; /* 043 */ agg_initAgg = false; /* 044 */ /* 045 */ agg_vectorizedHashMap = new agg_VectorizedHashMap(); /* 046 */ /* 047 */ this.agg_plan = (org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0]; /* 048 */ /* 049 */ this.agg_peakMemory = (org.apache.spark.sql.execution.metric.SQLMetric) references[1]; /* 050 */ this.agg_spillSize = (org.apache.spark.sql.execution.metric.SQLMetric) references[2]; /* 051 */ this.scan_numOutputRows = (org.apache.spark.sql.execution.metric.SQLMetric) references[3]; /* 052 */ scan_input = inputs[0]; /* 053 */ this.filter_numOutputRows = (org.apache.spark.sql.execution.metric.SQLMetric) references[4]; /* 054 */ filter_result = new UnsafeRow(5); /* 055 */ this.filter_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(filter_result, 128); /* 056 */ this.filter_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(filter_holder, 5); /* 057 */ project_result = new Unsafe
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331370#comment-15331370 ] Pete Robbins commented on SPARK-15822: -- Chatting with [~hvanhovell] here is the current state. I can reproduce a segv using local[8] on an 8 core machine. It is intermittent but many many runs with eg local[2] produce no issues. The segv info is: {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7fe8c118ca58, pid=3558, tid=140633451779840 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 7467 C1 org.apache.spark.unsafe.Platform.getByte(Ljava/lang/Object;J)B (9 bytes) @ 0x7fe8c118ca58 [0x7fe8c118ca20+0x38] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # --- T H R E A D --- Current thread (0x7fe858018800): JavaThread "Executor task launch worker-3" daemon [_thread_in_Java, id=3698, stack(0x7fe7c6dfd000,0x7fe7c6efe000)] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00a09cf4 Registers: RAX=0x7fe884ce5828, RBX=0x7fe884ce5828, RCX=0x7fe81e0a5360, RDX=0x00a09cf4 RSP=0x7fe7c6efb9e0, RBP=0x7fe7c6efba80, RSI=0x, RDI=0x3848 R8 =0x200b94c8, R9 =0xeef66bf0, R10=0x7fe8d87a2f00, R11=0x7fe8c118ca20 R12=0x, R13=0x7fe7c6efba28, R14=0x7fe7c6efba98, R15=0x7fe858018800 RIP=0x7fe8c118ca58, EFLAGS=0x00010206, CSGSFS=0x0033, ERR=0x0004 TRAPNO=0x000e Top of Stack: (sp=0x7fe7c6efb9e0) 0x7fe7c6efb9e0: 7fe7c56941e8 0x7fe7c6efb9f0: 7fe7c6efbab0 7fe8c140c38c 0x7fe7c6efba00: 7fe8c1007d80 eef66bc8 0x7fe7c6efba10: 7fe7c6efba80 7fe8c1007700 0x7fe7c6efba20: 7fe8c1007700 00a09cf4 0x7fe7c6efba30: 0030 0x7fe7c6efba40: 7fe7c6efba40 7fe81e0a1f9b 0x7fe7c6efba50: 7fe7c6efba98 7fe81e0a5360 0x7fe7c6efba60: 7fe81e0a1fc0 0x7fe7c6efba70: 7fe7c6efba28 7fe7c6efba90 0x7fe7c6efba80: 7fe7c6efbae8 7fe8c1007700 0x7fe7c6efba90: ee4f4898 0x7fe7c6efbaa0: 004d 7fe7c6efbaa8 0x7fe7c6efbab0: 7fe81e0a42be 7fe7c6efbb18 0x7fe7c6efbac0: 7fe81e0a5360 0x7fe7c6efbad0: 7fe81e0a4338 7fe7c6efba90 0x7fe7c6efbae0: 7fe7c6efbb10 7fe7c6efbb60 0x7fe7c6efbaf0: 7fe8c1007a40 0x7fe7c6efbb00: 0003 0x7fe7c6efbb10: ee4f4898 eef67950 0x7fe7c6efbb20: 7fe7c6efbb20 7fe81e0a43f2 0x7fe7c6efbb30: 7fe7c6efbb78 7fe81e0a5360 0x7fe7c6efbb40: 7fe81e0a4418 0x7fe7c6efbb50: 7fe7c6efbb10 7fe7c6efbb70 0x7fe7c6efbb60: 7fe7c6efbbc0 7fe8c1007a40 0x7fe7c6efbb70: ee4f4898 eef67950 0x7fe7c6efbb80: 7fe7c6efbb80 7fe7c56844e5 0x7fe7c6efbb90: 7fe7c6efbc28 7fe7c5684950 0x7fe7c6efbba0: 7fe7c5684618 0x7fe7c6efbbb0: 7fe7c6efbb70 7fe7c6efbc18 0x7fe7c6efbbc0: 7fe7c6efbc70 7fe8c10077d0 0x7fe7c6efbbd0: Instructions: (pc=0x7fe8c118ca58) 0x7fe8c118ca38: 08 83 c7 08 89 78 08 48 b8 28 58 ce 84 e8 7f 00 0x7fe8c118ca48: 00 81 e7 f8 3f 00 00 83 ff 00 0f 84 16 00 00 00 0x7fe8c118ca58: 0f be 04 16 c1 e0 18 c1 f8 18 48 83 c4 30 5d 85 0x7fe8c118ca68: 05 93 c6 85 17 c3 48 89 44 24 08 48 c7 04 24 ff Register to memory mapping: RAX={method} {0x7fe884ce5828} 'getByte' '(Ljava/lang/Object;J)B' in 'org/apache/spark/unsafe/Platform' RBX={method} {0x7fe884ce5828} 'getByte' '(Ljava/lang/Object;J)B' in 'org/apache/spark/unsafe/Platform' RCX=0x7fe81e0a5360 is pointing into metadata RDX=0x00a09cf4 is an unknown value RSP=0x7fe7c6efb9e0 is pointing into the stack for thread: 0x7fe858018800 RBP=0x7fe7c6efba80 is pointing into the stack for thread: 0x7fe858018800 RSI=0x is an unknown value RDI=0x3848 is an unknown value R8 =0x200b94c8 is an unknown value R9 =0xeef66bf0 is an oop [B - klass: {type array byte} - length: 48 R10=0x7fe8d87a2f00: in /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.91-0.b14.el6_7.x86_64/jre/lib/amd64/server/libjvm.so
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331375#comment-15331375 ] Pete Robbins commented on SPARK-15822: -- and the plan: {noformat} == Parsed Logical Plan == 'Project [unresolvedalias('Origin, None), unresolvedalias('UniqueCarrier, None), 'round((('count * 100) / 'total), 2) AS rank#173] +- Project [Origin#16, UniqueCarrier#8, count#134L, total#97L] +- Join Inner, ((Origin#16 = Origin#155) && (UniqueCarrier#8 = UniqueCarrier#147)) :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, count(1) AS count#134L] : +- Filter (NOT (Cancelled#21 = 0) && (CancellationCode#22 = A)) : +- Filter (Dest#17 = ORD) :+- Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,... 5 more fields] csv +- Project [Origin#155, UniqueCarrier#147, count#92L AS total#97L] +- Aggregate [Origin#155, UniqueCarrier#147], [Origin#155, UniqueCarrier#147, count(1) AS count#92L] +- Filter (Dest#156 = ORD) +- Relation[Year#139,Month#140,DayofMonth#141,DayOfWeek#142,DepTime#143,CRSDepTime#144,ArrTime#145,CRSArrTime#146,UniqueCarrier#147,FlightNum#148,TailNum#149,ActualElapsedTime#150,CRSElapsedTime#151,AirTime#152,ArrDelay#153,DepDelay#154,Origin#155,Dest#156,Distance#157,TaxiIn#158,TaxiOut#159,Cancelled#160,CancellationCode#161,Diverted#162,... 5 more fields] csv == Analyzed Logical Plan == Origin: string, UniqueCarrier: string, rank: double Project [Origin#16, UniqueCarrier#8, round((cast((count#134L * cast(100 as bigint)) as double) / cast(total#97L as double)), 2) AS rank#173] +- Project [Origin#16, UniqueCarrier#8, count#134L, total#97L] +- Join Inner, ((Origin#16 = Origin#155) && (UniqueCarrier#8 = UniqueCarrier#147)) :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, count(1) AS count#134L] : +- Filter (NOT (Cancelled#21 = 0) && (CancellationCode#22 = A)) : +- Filter (Dest#17 = ORD) :+- Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,... 5 more fields] csv +- Project [Origin#155, UniqueCarrier#147, count#92L AS total#97L] +- Aggregate [Origin#155, UniqueCarrier#147], [Origin#155, UniqueCarrier#147, count(1) AS count#92L] +- Filter (Dest#156 = ORD) +- Relation[Year#139,Month#140,DayofMonth#141,DayOfWeek#142,DepTime#143,CRSDepTime#144,ArrTime#145,CRSArrTime#146,UniqueCarrier#147,FlightNum#148,TailNum#149,ActualElapsedTime#150,CRSElapsedTime#151,AirTime#152,ArrDelay#153,DepDelay#154,Origin#155,Dest#156,Distance#157,TaxiIn#158,TaxiOut#159,Cancelled#160,CancellationCode#161,Diverted#162,... 5 more fields] csv == Optimized Logical Plan == Project [Origin#16, UniqueCarrier#8, round((cast((count#134L * 100) as double) / cast(total#97L as double)), 2) AS rank#173] +- Join Inner, ((Origin#16 = Origin#155) && (UniqueCarrier#8 = UniqueCarrier#147)) :- Aggregate [Origin#16, UniqueCarrier#8], [Origin#16, UniqueCarrier#8, count(1) AS count#134L] : +- Project [UniqueCarrier#8, Origin#16] : +- Filter (((isnotnull(Origin#16) && isnotnull(UniqueCarrier#8)) && isnotnull(Cancelled#21)) && isnotnull(CancellationCode#22)) && NOT (Cancelled#21 = 0)) && (CancellationCode#22 = A)) && isnotnull(Dest#17)) && (Dest#17 = ORD)) :+- Relation[Year#0,Month#1,DayofMonth#2,DayOfWeek#3,DepTime#4,CRSDepTime#5,ArrTime#6,CRSArrTime#7,UniqueCarrier#8,FlightNum#9,TailNum#10,ActualElapsedTime#11,CRSElapsedTime#12,AirTime#13,ArrDelay#14,DepDelay#15,Origin#16,Dest#17,Distance#18,TaxiIn#19,TaxiOut#20,Cancelled#21,CancellationCode#22,Diverted#23,... 5 more fields] csv +- Aggregate [Origin#155, UniqueCarrier#147], [Origin#155, UniqueCarrier#147, count(1) AS total#97L] +- Project [UniqueCarrier#147, Origin#155] +- Filter (((isnotnull(UniqueCarrier#147) && isnotnull(Origin#155)) && isnotnull(Dest#156)) && (Dest#156 = ORD)) +- Relation[Year#139,Month#140,DayofMonth#141,DayOfWeek#142,DepTime#143,CRSDepTime#144,ArrTime#145,CRSArrTime#146,UniqueCarrier#147,FlightNum#148,TailNum#149,ActualElapsedTime#150,CRSElapsedTime#151,AirTime#152,ArrDelay#153,DepDelay#154,Origin#155,Dest#156,Distance#157,TaxiIn#158,TaxiOut#159,Cancelled#160,CancellationCode#161,Diverted#162,... 5 more fields] csv == Physical Plan == *Project [Origin#16, UniqueCarrier#8, round((cast((count#
[jira] [Updated] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching
[ https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15951: -- Priority: Minor (was: Major) Fix Version/s: (was: 2.1.0) Component/s: (was: Spark Core) Web UI > Change Executors Page to use datatables to support sorting columns and > searching > > > Key: SPARK-15951 > URL: https://issues.apache.org/jira/browse/SPARK-15951 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Kishor Patil >Priority: Minor > > Support column sort and search for Executors Server using jQuery DataTable > and REST API. Before this commit, the executors page was generated hard-coded > html and can not support search, also, the sorting was disabled if there is > any application that has more than one attempt. Supporting search and sort > (over all applications rather than the 20 entries in the current page) in any > case will greatly improve user experience. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15631) Dataset and encoder bug fixes
[ https://issues.apache.org/jira/browse/SPARK-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15631: -- Assignee: Wenchen Fan > Dataset and encoder bug fixes > - > > Key: SPARK-15631 > URL: https://issues.apache.org/jira/browse/SPARK-15631 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > This is an umbrella ticket for various Dataset and encoder bug fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15065) HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky
[ https://issues.apache.org/jira/browse/SPARK-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15065: -- Assignee: Pete Robbins > HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky > - > > Key: SPARK-15065 > URL: https://issues.apache.org/jira/browse/SPARK-15065 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Yin Huai >Assignee: Pete Robbins >Priority: Critical > Fix For: 2.0.0 > > Attachments: log.txt > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/861/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/dir/ > There are several WARN messages like {{16/05/02 00:51:06 WARN Master: Got > status update for unknown executor app-20160502005054-/3}}, which are > suspicious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331439#comment-15331439 ] Dongjoon Hyun commented on SPARK-15908: --- I'll working on this issue~. > Add varargs-type dropDuplicates() function in SparkR > > > Key: SPARK-15908 > URL: https://issues.apache.org/jira/browse/SPARK-15908 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > This is for API parity of Scala API. Refer to > https://issues.apache.org/jira/browse/SPARK-15807 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15518) Rename various scheduler backend for consistency
[ https://issues.apache.org/jira/browse/SPARK-15518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331455#comment-15331455 ] Apache Spark commented on SPARK-15518: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/13683 > Rename various scheduler backend for consistency > > > Key: SPARK-15518 > URL: https://issues.apache.org/jira/browse/SPARK-15518 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Various scheduler backends are not named consistently, making it difficult to > understand what they do based on the names. It would be great to rename some > of them: > - LocalScheduler -> LocalSchedulerBackend > - AppClient -> StandaloneAppClient > - AppClientListener -> StandaloneAppClientListener > - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend > - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend > - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331488#comment-15331488 ] Apache Spark commented on SPARK-15908: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13684 > Add varargs-type dropDuplicates() function in SparkR > > > Key: SPARK-15908 > URL: https://issues.apache.org/jira/browse/SPARK-15908 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > This is for API parity of Scala API. Refer to > https://issues.apache.org/jira/browse/SPARK-15807 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15908: Assignee: Apache Spark > Add varargs-type dropDuplicates() function in SparkR > > > Key: SPARK-15908 > URL: https://issues.apache.org/jira/browse/SPARK-15908 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Assignee: Apache Spark > > This is for API parity of Scala API. Refer to > https://issues.apache.org/jira/browse/SPARK-15807 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15908: Assignee: (was: Apache Spark) > Add varargs-type dropDuplicates() function in SparkR > > > Key: SPARK-15908 > URL: https://issues.apache.org/jira/browse/SPARK-15908 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > This is for API parity of Scala API. Refer to > https://issues.apache.org/jira/browse/SPARK-15807 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15908) Add varargs-type dropDuplicates() function in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331439#comment-15331439 ] Dongjoon Hyun edited comment on SPARK-15908 at 6/15/16 10:16 AM: - I'll work on this issue~. was (Author: dongjoon): I'll working on this issue~. > Add varargs-type dropDuplicates() function in SparkR > > > Key: SPARK-15908 > URL: https://issues.apache.org/jira/browse/SPARK-15908 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > This is for API parity of Scala API. Refer to > https://issues.apache.org/jira/browse/SPARK-15807 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331580#comment-15331580 ] Pete Robbins commented on SPARK-15822: -- I can also recreate this issue on Oracle JDK 1.8: {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f0c65d06aec, pid=7521, tid=0x7f0b69ffd700 # # JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # J 7453 C1 org.apache.spark.unsafe.Platform.getByte(Ljava/lang/Object;J)B (9 bytes) @ 0x7f0c65d06aec [0x7f0c65d06ae0+0xc] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # --- T H R E A D --- Current thread (0x7f0bf4008800): JavaThread "Executor task launch worker-3" daemon [_thread_in_Java, id=7662, stack(0x7f0b69efd000,0x7f0b69ffe000)] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x02868e54 Registers: RAX=0x7f0c461abb38, RBX=0x7f0c461abb38, RCX=0x7f0c213547c8, RDX=0x02868e54 RSP=0x7f0b69ffba40, RBP=0x7f0b69ffbae0, RSI=0x, RDI=0x0001008254d8 R8 =0x200bd0a6, R9 =0xd9fa2650, R10=0x7f0c79d39020, R11=0x7f0c65d06ae0 R12=0x, R13=0x7f0b69ffba88, R14=0x7f0b69ffbaf8, R15=0x7f0bf4008800 RIP=0x7f0c65d06aec, EFLAGS=0x00010202, CSGSFS=0x0033, ERR=0x0004 TRAPNO=0x000e Top of Stack: (sp=0x7f0b69ffba40) 0x7f0b69ffba40: 7f0b684b4a70 0x7f0b69ffba50: 7f0b69ffbb10 7f0c65e96d4c 0x7f0b69ffba60: 7f0c65008040 d9fa2628 0x7f0b69ffba70: 7f0b69ffbae0 7f0c650079c0 0x7f0b69ffba80: 7f0c650079c0 02868e54 0x7f0b69ffba90: 0030 0x7f0b69ffbaa0: 7f0b69ffbaa0 7f0c21351403 0x7f0b69ffbab0: 7f0b69ffbaf8 7f0c213547c8 0x7f0b69ffbac0: 7f0c21351428 0x7f0b69ffbad0: 7f0b69ffba88 7f0b69ffbaf0 0x7f0b69ffbae0: 7f0b69ffbb48 7f0c650079c0 0x7f0b69ffbaf0: d9f57cf0 0x7f0b69ffbb00: 004c 7f0b69ffbb08 0x7f0b69ffbb10: 7f0c21353726 7f0b69ffbb78 0x7f0b69ffbb20: 7f0c213547c8 0x7f0b69ffbb30: 7f0c213537a0 7f0b69ffbaf0 0x7f0b69ffbb40: 7f0b69ffbb70 7f0b69ffbbc0 0x7f0b69ffbb50: 7f0c65007d00 0x7f0b69ffbb60: 0003 0x7f0b69ffbb70: d9f57cf0 d9fa33b0 0x7f0b69ffbb80: 7f0b69ffbb80 7f0c2135385a 0x7f0b69ffbb90: 7f0b69ffbbd8 7f0c213547c8 0x7f0b69ffbba0: 7f0c21353880 0x7f0b69ffbbb0: 7f0b69ffbb70 7f0b69ffbbd0 0x7f0b69ffbbc0: 7f0b69ffbc20 7f0c65007d00 0x7f0b69ffbbd0: d9f57cf0 d9fa33b0 0x7f0b69ffbbe0: 7f0b69ffbbe0 7f0b684a24e5 0x7f0b69ffbbf0: 7f0b69ffbc88 7f0b684a2950 0x7f0b69ffbc00: 7f0b684a2618 0x7f0b69ffbc10: 7f0b69ffbbd0 7f0b69ffbc78 0x7f0b69ffbc20: 7f0b69ffbcd0 7f0c65007a90 0x7f0b69ffbc30: Instructions: (pc=0x7f0c65d06aec) 0x7f0c65d06acc: 0a 80 11 64 01 f8 12 fe 06 90 0c 64 01 f8 12 fe 0x7f0c65d06adc: 06 90 0c 64 89 84 24 00 c0 fe ff 55 48 83 ec 30 0x7f0c65d06aec: 0f be 04 16 c1 e0 18 c1 f8 18 48 83 c4 30 5d 85 0x7f0c65d06afc: 05 ff f5 28 14 c3 90 90 49 8b 87 a8 02 00 00 49 Register to memory mapping: RAX={method} {0x7f0c461abb38} 'getByte' '(Ljava/lang/Object;J)B' in 'org/apache/spark/unsafe/Platform' RBX={method} {0x7f0c461abb38} 'getByte' '(Ljava/lang/Object;J)B' in 'org/apache/spark/unsafe/Platform' RCX=0x7f0c213547c8 is pointing into metadata RDX=0x02868e54 is an unknown value RSP=0x7f0b69ffba40 is pointing into the stack for thread: 0x7f0bf4008800 RBP=0x7f0b69ffbae0 is pointing into the stack for thread: 0x7f0bf4008800 RSI=0x is an unknown value RDI=0x0001008254d8 is pointing into metadata R8 =0x200bd0a6 is an unknown value R9 =0xd9fa2650 is an oop [B - klass: {type array byte} - length: 48 R10=0x7f0c79d39020: in /home/robbins/sdks/jdk1.8.0_92/jre/lib/amd64/server/libjvm.so at 0x7f0c78d7d000 R11=0x7f0c65d06ae0 is at entry_point+0 in (nmethod*)0x7f0c65d06990 R12=0x is an unknown value R13=0x7f0b69ffba88 is
[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end
[ https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331613#comment-15331613 ] Dongjoon Hyun commented on SPARK-14692: --- Hi, [~nmolkeri]. It seems that you just needed `Sys.setenv(SPARK_HOME="/Users/yourid/spark")` at the first line at that time. It's a little bit old issue. If there is no further comments, I think we had better close this. > Error While Setting the path for R front end > > > Key: SPARK-14692 > URL: https://issues.apache.org/jira/browse/SPARK-14692 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 > Environment: Mac OSX >Reporter: Niranjan Molkeri` > > Trying to set Environment path for SparkR in RStudio. > Getting this bug. > > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > > library(SparkR) > Error in library(SparkR) : there is no package called ‘SparkR’ > > sc <- sparkR.init(master="local") > Error: could not find function "sparkR.init" > In the directory which it is pointed. There is directory called SparkR. I > don't know how to proceed with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-15046. --- Resolution: Fixed Fix Version/s: 2.0.0 > When running hive-thriftserver with yarn on a secure cluster the workers fail > with java.lang.NumberFormatException > -- > > Key: SPARK-15046 > URL: https://issues.apache.org/jira/browse/SPARK-15046 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Trystan Leftwich >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.0.0 > > > When running hive-thriftserver with yarn on a secure cluster > (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with > the following error. > {code} > 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: > java.lang.NumberFormatException: For input string: "86400079ms" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:441) > at java.lang.Long.parseLong(Long.java:483) > at > scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) > at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at scala.Option.map(Option.scala:146) > at org.apache.spark.SparkConf.getLong(SparkConf.scala:380) > at > org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-15046: -- Assignee: Marcelo Vanzin > When running hive-thriftserver with yarn on a secure cluster the workers fail > with java.lang.NumberFormatException > -- > > Key: SPARK-15046 > URL: https://issues.apache.org/jira/browse/SPARK-15046 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Trystan Leftwich >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.0.0 > > > When running hive-thriftserver with yarn on a secure cluster > (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with > the following error. > {code} > 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: > java.lang.NumberFormatException: For input string: "86400079ms" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:441) > at java.lang.Long.parseLong(Long.java:483) > at > scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) > at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at scala.Option.map(Option.scala:146) > at org.apache.spark.SparkConf.getLong(SparkConf.scala:380) > at > org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
Liwei Lin created SPARK-15963: - Summary: `TaskKilledException` is not correctly caught in `Executor.TaskRunner` Key: SPARK-15963 URL: https://issues.apache.org/jira/browse/SPARK-15963 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 2.1.0 Reporter: Liwei Lin Currently in {{Executor.TaskRunner}}, we: {code} try {...} catch { case _: TaskKilledException | _: InterruptedException if task.killed => ... } {code} What we intended was: - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}}) But fact is: - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}} As a consequence, sometimes we can not catch {{TaskKilledException}} and will incorrectly report our task status as {{FAILED}} (which should really be {{KILLED}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
[ https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15963: Assignee: Apache Spark > `TaskKilledException` is not correctly caught in `Executor.TaskRunner` > -- > > Key: SPARK-15963 > URL: https://issues.apache.org/jira/browse/SPARK-15963 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0 >Reporter: Liwei Lin >Assignee: Apache Spark > > Currently in {{Executor.TaskRunner}}, we: > {code} > try {...} > catch { > case _: TaskKilledException | _: InterruptedException if task.killed => > ... > } > {code} > What we intended was: > - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}}) > But fact is: > - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}} > As a consequence, sometimes we can not catch {{TaskKilledException}} and will > incorrectly report our task status as {{FAILED}} (which should really be > {{KILLED}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
[ https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15963: Assignee: (was: Apache Spark) > `TaskKilledException` is not correctly caught in `Executor.TaskRunner` > -- > > Key: SPARK-15963 > URL: https://issues.apache.org/jira/browse/SPARK-15963 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0 >Reporter: Liwei Lin > > Currently in {{Executor.TaskRunner}}, we: > {code} > try {...} > catch { > case _: TaskKilledException | _: InterruptedException if task.killed => > ... > } > {code} > What we intended was: > - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}}) > But fact is: > - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}} > As a consequence, sometimes we can not catch {{TaskKilledException}} and will > incorrectly report our task status as {{FAILED}} (which should really be > {{KILLED}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15963) `TaskKilledException` is not correctly caught in `Executor.TaskRunner`
[ https://issues.apache.org/jira/browse/SPARK-15963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331837#comment-15331837 ] Apache Spark commented on SPARK-15963: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/13685 > `TaskKilledException` is not correctly caught in `Executor.TaskRunner` > -- > > Key: SPARK-15963 > URL: https://issues.apache.org/jira/browse/SPARK-15963 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0 >Reporter: Liwei Lin > > Currently in {{Executor.TaskRunner}}, we: > {code} > try {...} > catch { > case _: TaskKilledException | _: InterruptedException if task.killed => > ... > } > {code} > What we intended was: > - {{TaskKilledException}} OR ({{InterruptedException}} AND {{task.killed}}) > But fact is: > - ({{TaskKilledException}} OR {{InterruptedException}}) AND {{task.killed}} > As a consequence, sometimes we can not catch {{TaskKilledException}} and will > incorrectly report our task status as {{FAILED}} (which should really be > {{KILLED}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15964) Assignment to RDD-typed val fails
Sanjay Dasgupta created SPARK-15964: --- Summary: Assignment to RDD-typed val fails Key: SPARK-15964 URL: https://issues.apache.org/jira/browse/SPARK-15964 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Environment: Notebook on Databricks Community-Edition Spark-2.0 preview Google Chrome Browser Linux Ubuntu 14.04 LTS Reporter: Sanjay Dasgupta Unusual assignment error, giving the following error message: found : org.apache.spark.rdd.RDD[Name] required : org.apache.spark.rdd.RDD[Name] This occurs when the assignment is attempted in a cell that is different from the cell in which the item on the right-hand-side is defined. As in the following example: // CELL-1 import org.apache.spark.sql.Dataset import org.apache.spark.rdd.RDD case class Name(number: Int, name: String) val names = Seq(Name(1, "one"), Name(2, "two"), Name(3, "three"), Name(4, "four")) val dataset: Dataset[Name] = spark.sparkContext.parallelize(names).toDF.as[Name] // CELL-2 // Error reported here ... val dataRdd: RDD[Name] = dataset.rdd The error is reported in CELL-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator
[ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331907#comment-15331907 ] Imran Rashid commented on SPARK-15815: -- yes, I agree that SPARK-15865 isn't ideal on its own. SPARK-8426 will add better blacklisting which will help some. And then as a follow up after that, I intend to add actively killing blacklisted executors. But I actually think it won't change things -- we'll still abort the taskset when we first discover a task that can't be scheduled, because even with Dynamic Allocation, we'll never really know if we're going to get another executor. Its not ideal, but I think the first step is to be sure we're preventing the app from hanging. > Hang while enable blacklistExecutor and DynamicExecutorAllocator > - > > Key: SPARK-15815 > URL: https://issues.apache.org/jira/browse/SPARK-15815 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > Enable BlacklistExecutor with some time large than 120s and enabled > DynamicAllocate with minExecutors = 0 > 1. Assume there only left 1 task running in Executor A, and other Executor > are all timeout. > 2. the task failed, so task will not scheduled in current Executor A due to > enable blacklistTime. > 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 > executors, due to we already have executor A, so the oldTargetNumExecutor == > targetNumExecutor = 1, so will never add more Executors...even if Executor A > was timeout. it became endless request delta=0 executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1
thauvin damien created SPARK-15965: -- Summary: No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1 Key: SPARK-15965 URL: https://issues.apache.org/jira/browse/SPARK-15965 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.6.1 Environment: Debian GNU/Linux 8 java version "1.7.0_79" Reporter: thauvin damien The spark programming-guide explain that Spark can create distributed datasets on Amazon S3 . But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or s3a. Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with hadoop.7.2 . I understand this is an Hadoop Issue (SPARK-7442) but can you make some documentation to explain what jar we need to add and where ? ( for standalone installation) . "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? What env variable we need to set and what file we need to modifiy . Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable "spark.driver.extraClassPath" and "spark.executor.extraClassPath" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15966) Fix markdown for Spark Monitoring
Dhruve Ashar created SPARK-15966: Summary: Fix markdown for Spark Monitoring Key: SPARK-15966 URL: https://issues.apache.org/jira/browse/SPARK-15966 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.0.0 Reporter: Dhruve Ashar Priority: Trivial The markdown for Spark monitoring needs to be fixed. http://spark.apache.org/docs/2.0.0-preview/monitoring.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15861) pyspark mapPartitions with none generator functions / functors
[ https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331975#comment-15331975 ] Bryan Cutler commented on SPARK-15861: -- If you change your function to this {noformat} def to_np(data): return [np.array(list(data))] {noformat} I think you would get what you expect, but this is probably not a good way to go about it. I feel like you should be aggregating your lists into numpy arrays instead, but someone else might know better. > pyspark mapPartitions with none generator functions / functors > -- > > Key: SPARK-15861 > URL: https://issues.apache.org/jira/browse/SPARK-15861 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 >Reporter: Greg Bowyer >Priority: Minor > > Hi all, it appears that the method `rdd.mapPartitions` does odd things if it > is fed a normal subroutine. > For instance, lets say we have the following > {code} > rows = range(25) > rows = [rows[i:i+5] for i in range(0, len(rows), 5)] > rdd = sc.parallelize(rows, 2) > def to_np(data): > return np.array(list(data)) > rdd.mapPartitions(to_np).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > rdd.mapPartitions(to_np, preservePartitioning=True).collect() > ... > [array([0, 1, 2, 3, 4]), > array([5, 6, 7, 8, 9]), > array([10, 11, 12, 13, 14]), > array([15, 16, 17, 18, 19]), > array([20, 21, 22, 23, 24])] > {code} > This basically makes the provided function that did return act like the end > user called {code}rdd.map{code} > I think that maybe a check should be put in to call > {code}inspect.isgeneratorfunction{code} > ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] thauvin damien updated SPARK-15965: --- Description: The spark programming-guide explain that Spark can create distributed datasets on Amazon S3 . But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or s3a. sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH") sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", "xxx") val lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz") java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with hadoop.7.2 . I understand this is an Hadoop Issue (SPARK-7442) but can you make some documentation to explain what jar we need to add and where ? ( for standalone installation) . "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? What env variable we need to set and what file we need to modifiy . Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable "spark.driver.extraClassPath" and "spark.executor.extraClassPath" But Still Works with spark-1.6.1 pre build with hadoop2.4 Thanks was: The spark programming-guide explain that Spark can create distributed datasets on Amazon S3 . But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or s3a. Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with hadoop.7.2 . I understand this is an Hadoop Issue (SPARK-7442) but can you make some documentation to explain what jar we need to add and where ? ( for standalone installation) . "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? What env variable we need to set and what file we need to modifiy . Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable "spark.driver.extraClassPath" and "spark.executor.extraClassPath" > No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1 > - > > Key: SPARK-15965 > URL: https://issues.apache.org/jira/browse/SPARK-15965 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.1 > Environment: Debian GNU/Linux 8 > java version "1.7.0_79" >Reporter: thauvin damien > Original Estimate: 8h > Remaining Estimate: 8h > > The spark programming-guide explain that Spark can create distributed > datasets on Amazon S3 . > But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or > s3a. > sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH") > sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", > "xxx") > val > lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz") > java.lang.RuntimeException: java.lang.ClassNotFoundException: Class > org.apache.hadoop.fs.s3a.S3AFileSystem not found > Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with > hadoop.7.2 . > I understand this is an Hadoop Issue (SPARK-7442) but can you make some > documentation to explain what jar we need to add and where ? ( for standalone > installation) . > "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? > What env variable we need to set and what file we need to modifiy . > Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable > "spark.driver.extraClassPath" and "spark.executor.extraClassPath" > But Still Works with spark-1.6.1 pre build with hadoop2.4 > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15849) FileNotFoundException on _temporary while doing saveAsTable to S3
[ https://issues.apache.org/jira/browse/SPARK-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332034#comment-15332034 ] Sandeep commented on SPARK-15849: - Thanks Thomas for that comment. I have verified both the things with direct committer : 1. The inconsistency issue no longer occurs 2. I see a 2x speedup too Looking forward to the fix directly in Hadoop, so that the knob doesnt have to be explicitly set > FileNotFoundException on _temporary while doing saveAsTable to S3 > - > > Key: SPARK-15849 > URL: https://issues.apache.org/jira/browse/SPARK-15849 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: AWS EC2 with spark on yarn and s3 storage >Reporter: Sandeep > > When submitting spark jobs to yarn cluster, I occasionally see these error > messages while doing saveAsTable. I have tried doing this with > spark.speculation=false, and get the same error. These errors are similar to > SPARK-2984, but my jobs are writing to S3(s3n) : > Caused by: java.io.FileNotFoundException: File > s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist. > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151) > ... 42 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15960) Audit new SQL confs
[ https://issues.apache.org/jira/browse/SPARK-15960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15960: Issue Type: Sub-task (was: Bug) Parent: SPARK-15426 > Audit new SQL confs > > > Key: SPARK-15960 > URL: https://issues.apache.org/jira/browse/SPARK-15960 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > Check the current SQL configuration names for inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15426) Spark 2.0 SQL API audit
[ https://issues.apache.org/jira/browse/SPARK-15426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15426. - Resolution: Fixed Fix Version/s: 2.0.0 > Spark 2.0 SQL API audit > --- > > Key: SPARK-15426 > URL: https://issues.apache.org/jira/browse/SPARK-15426 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > This is an umbrella ticket to list issues I found with APIs for the 2.0 > release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15960) Audit new SQL confs
[ https://issues.apache.org/jira/browse/SPARK-15960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15960. - Resolution: Fixed Fix Version/s: 2.0.0 > Audit new SQL confs > > > Key: SPARK-15960 > URL: https://issues.apache.org/jira/browse/SPARK-15960 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > Check the current SQL configuration names for inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15967) Spark UI should show dynamically changed value of storage memory instead of showing one static value all the time
Umesh K created SPARK-15967: --- Summary: Spark UI should show dynamically changed value of storage memory instead of showing one static value all the time Key: SPARK-15967 URL: https://issues.apache.org/jira/browse/SPARK-15967 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.6.1, 1.6.0 Reporter: Umesh K Priority: Minor As of Spark 1.6.x we have unified memory management and hence execution/storage memory changes over time for e.g. if execution grows it will take memory from storage and vice-versa. But if we Spark UI shows one value all the time in storage tab like it used to in previous version. Ideally storage memory values should get refreshed to show real time value of storage memory in the Spark UI so we can actually visualize that stealing between execution and storage is happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332074#comment-15332074 ] Michael Allman commented on SPARK-15968: I have a patched based on the way this method was implemented in Spark 1.5. I'm working on a PR presently. > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman > Labels: hive, metastore > > The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` > from the metastore relation's catalog table. This only returns the table base > path, which is not correct for partitioned tables. As a result, cache lookups > on partitioned tables always miss and these relations are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-15968: --- Description: The {{getCached}} method of {{HiveMetastoreCatalog}} computes {{pathsInMetastore}} from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cache lookups on partitioned tables always miss and these relations are always recomputed. (was: The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cache lookups on partitioned tables always miss and these relations are always recomputed.) > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman > Labels: hive, metastore > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for partitioned tables. As > a result, cache lookups on partitioned tables always miss and these relations > are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
Michael Allman created SPARK-15968: -- Summary: HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache Key: SPARK-15968 URL: https://issues.apache.org/jira/browse/SPARK-15968 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Michael Allman The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cache lookups on partitioned tables always miss and these relations are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15967) Spark UI should show realtime value of storage memory instead of showing one static value all the time
[ https://issues.apache.org/jira/browse/SPARK-15967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Umesh K updated SPARK-15967: Summary: Spark UI should show realtime value of storage memory instead of showing one static value all the time (was: Spark UI should show dynamically changed value of storage memory instead of showing one static value all the time) > Spark UI should show realtime value of storage memory instead of showing one > static value all the time > -- > > Key: SPARK-15967 > URL: https://issues.apache.org/jira/browse/SPARK-15967 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Umesh K >Priority: Minor > > As of Spark 1.6.x we have unified memory management and hence > execution/storage memory changes over time for e.g. if execution grows it > will take memory from storage and vice-versa. But if we Spark UI shows one > value all the time in storage tab like it used to in previous version. > Ideally storage memory values should get refreshed to show real time value of > storage memory in the Spark UI so we can actually visualize that stealing > between execution and storage is happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332123#comment-15332123 ] Apache Spark commented on SPARK-15968: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/13686 > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman > Labels: hive, metastore > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for partitioned tables. As > a result, cache lookups on partitioned tables always miss and these relations > are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15968: Assignee: (was: Apache Spark) > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman > Labels: hive, metastore > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for partitioned tables. As > a result, cache lookups on partitioned tables always miss and these relations > are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15968) HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache
[ https://issues.apache.org/jira/browse/SPARK-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15968: Assignee: Apache Spark > HiveMetastoreCatalog does not correctly validate partitioned metastore > relation when searching the internal table cache > --- > > Key: SPARK-15968 > URL: https://issues.apache.org/jira/browse/SPARK-15968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Michael Allman >Assignee: Apache Spark > Labels: hive, metastore > > The {{getCached}} method of {{HiveMetastoreCatalog}} computes > {{pathsInMetastore}} from the metastore relation's catalog table. This only > returns the table base path, which is not correct for partitioned tables. As > a result, cache lookups on partitioned tables always miss and these relations > are always recomputed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15811: --- Priority: Blocker (was: Critical) > UDFs do not work in Spark 2.0-preview built with scala 2.10 > --- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15931) SparkR tests failing on R 3.3.0
[ https://issues.apache.org/jira/browse/SPARK-15931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-15931. --- Resolution: Fixed Assignee: Felix Cheung Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/13636 > SparkR tests failing on R 3.3.0 > --- > > Key: SPARK-15931 > URL: https://issues.apache.org/jira/browse/SPARK-15931 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Felix Cheung > Fix For: 2.0.0 > > > Environment: > # Spark master Git revision: > [f5d38c39255cc75325c6639561bfec1bc051f788|https://github.com/apache/spark/tree/f5d38c39255cc75325c6639561bfec1bc051f788] > # R version: 3.3.0 > To reproduce this, just build Spark with {{-Psparkr}} and run the tests. > Relevant log lines: > {noformat} > ... > Failed > - > 1. Failure: Check masked functions (@test_context.R#44) > > length(maskedCompletely) not equal to length(namesOfMaskedCompletely). > 1/1 mismatches > [1] 3 - 5 == -2 > 2. Failure: Check masked functions (@test_context.R#45) > > sort(maskedCompletely) not equal to sort(namesOfMaskedCompletely). > Lengths differ: 3 vs 5 > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-15811: -- Assignee: Davies Liu > UDFs do not work in Spark 2.0-preview built with scala 2.10 > --- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Davies Liu >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity
[ https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15953. -- Resolution: Fixed Fix Version/s: 2.0.0 > Renamed ContinuousQuery to StreamingQuery for simplicity > > > Key: SPARK-15953 > URL: https://issues.apache.org/jira/browse/SPARK-15953 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.0 > > > Make the API more intuitive by removing the term "Continuous". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332185#comment-15332185 ] Miao Wang commented on SPARK-15784: --- [~josephkb][~mengxr][~yanboliang] I am trying to add PIC to spark.ml and I have some questions regarding model.predict and saveImpl. The basic PIC algorithm has the following steps: Input: A row-normalized affinity matrix W and the number of clusters k Output: Clusters C1, C2, …, Ck Pick an initial vector v0 Repeat Set vt+1 ← Wvt Set δt+1 ← |vt+1 – vt| Increment t Stop when |δt – δt-1| ≈ 0 Use k-means to cluster points on vt and return clusters C1, C2, …, Ck In the last step, k-means takes the pseudo-eigenvector `v ` generated by PIC to do the classification. Therefore, the model.predict should use the trained k-means to do the prediction. However, the vector `v` should run PIC again on the data to be predicted. So, there is no trained model for predicting new data set. model.predict is actually training again using the PIC.fit method. In this case, PIC.fit and PIC.predict actually call the same run method in MLLib implementation. Since we have to train data anyway, the model save is not useful as there is no model to be save. In the MLLib implementation, save function saves the assignment results of the current data set, which can't be used for new data clustering. The only usage of the result is when the same data is given, we don't have to train again. However, we don't know whether it is the previous training data from the saved model. Please correct me if I misunderstand anything. Thanks! Miao > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15969) FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit
Kun Liu created SPARK-15969: --- Summary: FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit Key: SPARK-15969 URL: https://issues.apache.org/jira/browse/SPARK-15969 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.6.1, 1.5.0 Environment: Mac OS X 10.11.5 Reporter: Kun Liu Priority: Minor First time to open a JIRA issue. Newbie to the Spark community. Correct me if I was wrong. Thanks. An exception, java.io.FileNotFoundException, happened when multiple arguments were specified for the -py-files (also -jars) flag. I searched for a while but only found a similar issue on Windows OS: https://issues.apache.org/jira/browse/SPARK-6435 My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1 1.1 Observations: 1) Quotation does not make any difference for the arguments, the result will always be the same 2) The first path before comma, as long as valid, won’t be a problem whether it is an absolute or a relative path 3) The second and further py-files paths won’t be a problem if ALL of them are: a. are relative paths under the same directory as the working directory ($PWD); OR b. specified by using environment variable at the beginning, e.g. $ENV_VAR/path/to/file; OR c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter absolute or relative paths, as long as valid 4) The path of the driver program, assuming valid, does not matter, as it is a single file 1.2 Experiments: Assuming main.py calls functions from helper1.py and helper2.py, and all paths below are valid. ~/Desktop/testpath: main.py, helper1.py, helper2.py $SPARK_HOME/testpath: helper1.py, helper2.py 1) Successful output: a. Multiple python paths are relative paths under the same directory as the working directory cd $SPARK_HOME bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py ~/Desktop/testpath/main.py cd ~/Desktop $SPARK_HOME/bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py testpath/main.py b. Multiple python paths are specified by using environment variable export TEST_DIR=~/Desktop/testpath cd ~ $SPARK_HOME/bin/spark-submit --py-files $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py cd ~/Documents $SPARK_HOME/bin/spark-submit --py-files $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py c. Multiple paths (absolute or relative) after being preprocessed: $SPARK_HOME/bin/spark-submit --py-files $(echo $SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py cd ~/Desktop $SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py (reference link: http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option) 2) Failure output: if the second python path is an absolute one; the same problem will happen for further paths cd ~/Documents $SPARK_HOME/bin/spark-submit --py-files ~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py ~/Desktop/testpath/main.py py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.io.FileNotFoundException: Added file file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist. 1.3 Conclusions I would suggest the py-files flag of spark-submit could support all absolute paths arguments, not just relative path under the working directory. If necessary, I would like to submit a pull request and start working on it as my first contribution to the Spark community. 1.4 Note 1) I think the same issue will happen when multiple jar files delimited by comma are passed to the —jars flag flag for Java applications. 2) I suggest wildcard paths arguments should also be supported, as indicated by https://issues.apache.org/jira/browse/SPARK-3451 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15964) Assignment to RDD-typed val fails
[ https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332224#comment-15332224 ] Michael Armbrust commented on SPARK-15964: -- Thanks for reporting this, but I believe this is actually specific to the databricks environment (i.e. it works in the spark shell). The issue here is that there is a scala compiler bug and as far as we know, you have two choices: - path dependent types work (i.e. you can refer to a type from another cell in the next cell) - multi line (:paste mode in the spark shell) commands work with SQL implicits. Many more workloads in notebooks depend on the latter, while the former is more common in the command line REPL. This is why the behavior differs. I'm hoping the scala 2.11 will give us the best of both worlds if we can fix https://issues.scala-lang.org/browse/SI-9799 > Assignment to RDD-typed val fails > - > > Key: SPARK-15964 > URL: https://issues.apache.org/jira/browse/SPARK-15964 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Notebook on Databricks Community-Edition > Spark-2.0 preview > Google Chrome Browser > Linux Ubuntu 14.04 LTS >Reporter: Sanjay Dasgupta > > Unusual assignment error, giving the following error message: > found : org.apache.spark.rdd.RDD[Name] > required : org.apache.spark.rdd.RDD[Name] > This occurs when the assignment is attempted in a cell that is different from > the cell in which the item on the right-hand-side is defined. As in the > following example: > // CELL-1 > import org.apache.spark.sql.Dataset > import org.apache.spark.rdd.RDD > case class Name(number: Int, name: String) > val names = Seq(Name(1, "one"), Name(2, "two"), Name(3, "three"), Name(4, > "four")) > val dataset: Dataset[Name] = > spark.sparkContext.parallelize(names).toDF.as[Name] > // CELL-2 > // Error reported here ... > val dataRdd: RDD[Name] = dataset.rdd > The error is reported in CELL-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15964) Assignment to RDD-typed val fails
[ https://issues.apache.org/jira/browse/SPARK-15964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15964. -- Resolution: Won't Fix > Assignment to RDD-typed val fails > - > > Key: SPARK-15964 > URL: https://issues.apache.org/jira/browse/SPARK-15964 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Notebook on Databricks Community-Edition > Spark-2.0 preview > Google Chrome Browser > Linux Ubuntu 14.04 LTS >Reporter: Sanjay Dasgupta > > Unusual assignment error, giving the following error message: > found : org.apache.spark.rdd.RDD[Name] > required : org.apache.spark.rdd.RDD[Name] > This occurs when the assignment is attempted in a cell that is different from > the cell in which the item on the right-hand-side is defined. As in the > following example: > // CELL-1 > import org.apache.spark.sql.Dataset > import org.apache.spark.rdd.RDD > case class Name(number: Int, name: String) > val names = Seq(Name(1, "one"), Name(2, "two"), Name(3, "three"), Name(4, > "four")) > val dataset: Dataset[Name] = > spark.sparkContext.parallelize(names).toDF.as[Name] > // CELL-2 > // Error reported here ... > val dataRdd: RDD[Name] = dataset.rdd > The error is reported in CELL-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero
[ https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332276#comment-15332276 ] Thomas Graves commented on SPARK-15955: --- what master and deploy mode are you using? > Failed Spark application returns with exitcode equals to zero > - > > Key: SPARK-15955 > URL: https://issues.apache.org/jira/browse/SPARK-15955 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora > > Scenario: > * Set up cluster with wire-encryption enabled. > * set 'spark.authenticate.enableSaslEncryption' = 'false' and > 'spark.shuffle.service.enabled' :'true' > * run sparkPi application. > {code} > client token: Token { kind: YARN_CLIENT_TOKEN, service: } > diagnostics: Max number of executor failures (3) reached > ApplicationMaster host: xx.xx.xx.xxx > ApplicationMaster RPC port: 0 > queue: default > start time: 1465941051976 > final status: FAILED > tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/ > user: hrt_qa > Exception in thread "main" org.apache.spark.SparkException: Application > application_1465925772890_0016 finished with failed status > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > INFO ShutdownHookManager: Shutdown hook called{code} > This spark application exits with exitcode = 0. Failed application should not > return with exitcode = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15811: --- Summary: Python UDFs do not work in Spark 2.0-preview built with scala 2.10 (was: UDFs do not work in Spark 2.0-preview built with scala 2.10) > Python UDFs do not work in Spark 2.0-preview built with scala 2.10 > -- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Davies Liu >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15935) Enable test for sql/streaming.py and fix these tests
[ https://issues.apache.org/jira/browse/SPARK-15935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15935. -- Resolution: Fixed Fix Version/s: 2.0.0 > Enable test for sql/streaming.py and fix these tests > > > Key: SPARK-15935 > URL: https://issues.apache.org/jira/browse/SPARK-15935 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Right now tests sql/streaming.py are disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode
Xin Wu created SPARK-15970: -- Summary: WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode Key: SPARK-15970 URL: https://issues.apache.org/jira/browse/SPARK-15970 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Priority: Minor When we run Spark-shell in In-Memory catalog mode, creating a datasource table that is not compatible with hive will show a warning messaging saying it can not persist the table in hive compatible way. However, In-Memory catalog mode should not involve in trying to persist table in hive megastore at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15959) Add the support of hive.metastore.warehouse.dir back
[ https://issues.apache.org/jira/browse/SPARK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15959. - Resolution: Fixed Fix Version/s: 2.0.0 > Add the support of hive.metastore.warehouse.dir back > > > Key: SPARK-15959 > URL: https://issues.apache.org/jira/browse/SPARK-15959 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Right now, we do not load the value of this value at all > (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSharedState.scala#L35-L41). > Let's maintain the backward compatibility by loading it if spark's warehouse > conf is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15969) FileNotFoundException: Multiple arguments for py-files flag, (also jars) for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-15969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kun Liu updated SPARK-15969: Remaining Estimate: 120h (was: 168h) Original Estimate: 120h (was: 168h) > FileNotFoundException: Multiple arguments for py-files flag, (also jars) for > spark-submit > - > > Key: SPARK-15969 > URL: https://issues.apache.org/jira/browse/SPARK-15969 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.0, 1.6.1 > Environment: Mac OS X 10.11.5 >Reporter: Kun Liu >Priority: Minor > Original Estimate: 120h > Remaining Estimate: 120h > > First time to open a JIRA issue. Newbie to the Spark community. Correct me if > I was wrong. Thanks. > An exception, java.io.FileNotFoundException, happened when multiple arguments > were specified for the -py-files (also -jars) flag. > I searched for a while but only found a similar issue on Windows OS: > https://issues.apache.org/jira/browse/SPARK-6435 > My experiments environment was Mac OS X and Spark version 1.5.0 and 1.6.1 > 1.1 Observations: > 1) Quotation does not make any difference for the arguments, the result will > always be the same > 2) The first path before comma, as long as valid, won’t be a problem whether > it is an absolute or a relative path > 3) The second and further py-files paths won’t be a problem if ALL of them > are: > a. are relative paths under the same directory as the working directory > ($PWD); OR > b. specified by using environment variable at the beginning, e.g. > $ENV_VAR/path/to/file; OR > c. preprocessed by $(echo path/to/*.py | tr ' ' ‘,’), no matter > absolute or relative paths, as long as valid > 4) The path of the driver program, assuming valid, does not matter, as it is > a single file > 1.2 Experiments: > Assuming main.py calls functions from helper1.py and helper2.py, and all > paths below are valid. > ~/Desktop/testpath: main.py, helper1.py, helper2.py > $SPARK_HOME/testpath: helper1.py, helper2.py > 1) Successful output: > a. Multiple python paths are relative paths under the same directory as > the working directory > cd $SPARK_HOME > bin/spark-submit --py-files testpath/helper1.py,testpath/helper2.py > ~/Desktop/testpath/main.py > cd ~/Desktop > $SPARK_HOME/bin/spark-submit --py-files > testpath/helper1.py,testpath/helper2.py testpath/main.py > b. Multiple python paths are specified by using environment variable > export TEST_DIR=~/Desktop/testpath > cd ~ > $SPARK_HOME/bin/spark-submit --py-files > $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py > > cd ~/Documents > $SPARK_HOME/bin/spark-submit --py-files > $TEST_DIR/helper1.py,$TEST_DIR/helper2.py ~/Desktop/testpath/main.py > c. Multiple paths (absolute or relative) after being preprocessed: > $SPARK_HOME/bin/spark-submit --py-files $(echo > $SPARK_HOME/testpath/helper*.py | tr ' ' ',') ~/Desktop/testpath/main.py > cd ~/Desktop > $SPARK_HOME/bin/spark-submit --py-files $(echo testpath/helper*.py | tr > ' ' ',') ~/Desktop/testpath/main.py > (reference link: > http://stackoverflow.com/questions/24855368/spark-throws-classnotfoundexception-when-using-jars-option) > 2) Failure output: if the second python path is an absolute one; the same > problem will happen for further paths > cd ~/Documents > $SPARK_HOME/bin/spark-submit --py-files > ~/Desktop/testpath/helper1.py,~/Desktop/testpath/helper2.py > ~/Desktop/testpath/main.py > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.io.FileNotFoundException: Added file > file:/Users/kunliu/Documents/~/Desktop/testpath/helper2.py does not exist. > 1.3 Conclusions > I would suggest the py-files flag of spark-submit could support all absolute > paths arguments, not just relative path under the working directory. > If necessary, I would like to submit a pull request and start working on it > as my first contribution to the Spark community. > 1.4 Note > 1) I think the same issue will happen when multiple jar files delimited by > comma are passed to the —jars flag flag for Java applications. > 2) I suggest wildcard paths arguments should also be supported, as indicated > by https://issues.apache.org/jira/browse/SPARK-3451 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13850) TimSort Comparison method violates its general contract
[ https://issues.apache.org/jira/browse/SPARK-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13850. - Resolution: Fixed Fix Version/s: 2.0.0 1.6.2 > TimSort Comparison method violates its general contract > --- > > Key: SPARK-13850 > URL: https://issues.apache.org/jira/browse/SPARK-13850 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.0 >Reporter: Sital Kedia >Assignee: Sameer Agarwal > Fix For: 1.6.2, 2.0.0 > > > While running a query which does a group by on a large dataset, the query > fails with following stack trace. > {code} > Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most > recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, > hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison > method violates its general contract! > at > org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794) > at > org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) > at > org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) > at > org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Please note that the same query used to succeed in Spark 1.5 so it seems like > a regression in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15826) PipedRDD to allow configurable char encoding
[ https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated SPARK-15826: Summary: PipedRDD to allow configurable char encoding (was: PipedRDD to allow configurable char encoding (default: UTF-8)) > PipedRDD to allow configurable char encoding > > > Key: SPARK-15826 > URL: https://issues.apache.org/jira/browse/SPARK-15826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tejas Patil >Priority: Trivial > > Encountered an issue wherein the code works in some cluster but fails on > another one for the same input. After debugging realised that PipedRDD is > picking default char encoding from the JVM which may be different across > different platforms. Making it use UTF-8 encoding just like > `ScriptTransformation` does. > Stack trace: > {noformat} > Caused by: java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67) > at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) > at > org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868) > at > org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode
[ https://issues.apache.org/jira/browse/SPARK-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15970: Assignee: Apache Spark > WARNing message related to persisting table to Hive Megastore while Spark SQL > is running in-memory catalog mode > --- > > Key: SPARK-15970 > URL: https://issues.apache.org/jira/browse/SPARK-15970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Assignee: Apache Spark >Priority: Minor > > When we run Spark-shell in In-Memory catalog mode, creating a datasource > table that is not compatible with hive will show a warning messaging saying > it can not persist the table in hive compatible way. However, In-Memory > catalog mode should not involve in trying to persist table in hive megastore > at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode
[ https://issues.apache.org/jira/browse/SPARK-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15970: Assignee: (was: Apache Spark) > WARNing message related to persisting table to Hive Megastore while Spark SQL > is running in-memory catalog mode > --- > > Key: SPARK-15970 > URL: https://issues.apache.org/jira/browse/SPARK-15970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Minor > > When we run Spark-shell in In-Memory catalog mode, creating a datasource > table that is not compatible with hive will show a warning messaging saying > it can not persist the table in hive compatible way. However, In-Memory > catalog mode should not involve in trying to persist table in hive megastore > at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15970) WARNing message related to persisting table to Hive Megastore while Spark SQL is running in-memory catalog mode
[ https://issues.apache.org/jira/browse/SPARK-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332331#comment-15332331 ] Apache Spark commented on SPARK-15970: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/13687 > WARNing message related to persisting table to Hive Megastore while Spark SQL > is running in-memory catalog mode > --- > > Key: SPARK-15970 > URL: https://issues.apache.org/jira/browse/SPARK-15970 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Priority: Minor > > When we run Spark-shell in In-Memory catalog mode, creating a datasource > table that is not compatible with hive will show a warning messaging saying > it can not persist the table in hive compatible way. However, In-Memory > catalog mode should not involve in trying to persist table in hive megastore > at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15826) PipedRDD to allow configurable char encoding
[ https://issues.apache.org/jira/browse/SPARK-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15826. -- Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.0.0 > PipedRDD to allow configurable char encoding > > > Key: SPARK-15826 > URL: https://issues.apache.org/jira/browse/SPARK-15826 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Trivial > Fix For: 2.0.0 > > > Encountered an issue wherein the code works in some cluster but fails on > another one for the same input. After debugging realised that PipedRDD is > picking default char encoding from the JVM which may be different across > different platforms. Making it use UTF-8 encoding just like > `ScriptTransformation` does. > Stack trace: > {noformat} > Caused by: java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67) > at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:185) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1612) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1160) > at > org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868) > at > org.apache.spark.SparkContext$$anonfun$runJob$6.apply(SparkContext.scala:1868) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15971) GroupedData's member incorrectly named
Vladimir Feinberg created SPARK-15971: - Summary: GroupedData's member incorrectly named Key: SPARK-15971 URL: https://issues.apache.org/jira/browse/SPARK-15971 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 2.1.0 Reporter: Vladimir Feinberg Priority: Trivial The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as the member variable [[self._jdf]], which is exactly the same as [[pyspark.sql.DataFrame]], when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - in fact, in the [[DataFrame.groupBy]] implementation, the java object is referred to as exactly [[jgd]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15972) GroupedData varargs arguments misnamed
Vladimir Feinberg created SPARK-15972: - Summary: GroupedData varargs arguments misnamed Key: SPARK-15972 URL: https://issues.apache.org/jira/browse/SPARK-15972 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 2.1.0 Reporter: Vladimir Feinberg Priority: Trivial Simple aggregation functions which take column names [[cols]] as varargs arguments show up in documentation with the argument [[args]], but their documentation refers to [[cols]]. The discrepancy is caused by an annotation, [[df_varargs_api]], which produces a temporary function with arguments [[args]] instead of [[cols]], creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15973) GroupedData.pivot documentation off
Vladimir Feinberg created SPARK-15973: - Summary: GroupedData.pivot documentation off Key: SPARK-15973 URL: https://issues.apache.org/jira/browse/SPARK-15973 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 2.1.0 Reporter: Vladimir Feinberg Priority: Trivial {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest python comments, which messes up formatting in the documentation as well as the doctests themselves. A PR resolving this should probably resolve the other places this happens in pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15971) GroupedData's member incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15971: -- Description: The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as the member variable {{self._jdf}}, which is exactly the same as {{pyspark.sql.DataFrame}}, when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the java object is referred to as exactly {{jgd}} was: The [[pyspark.sql.GroupedData]] object calls the Java object it wraps around as the member variable [[self._jdf]], which is exactly the same as [[pyspark.sql.DataFrame]], when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to [[self._jgd]] - in fact, in the [[DataFrame.groupBy]] implementation, the java object is referred to as exactly [[jgd]] > GroupedData's member incorrectly named > -- > > Key: SPARK-15971 > URL: https://issues.apache.org/jira/browse/SPARK-15971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15972) GroupedData varargs arguments misnamed
[ https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15972: -- Description: Simple aggregation functions which take column names {{cols}} as varargs arguments show up in documentation with the argument {{args}}, but their documentation refers to {{cols}}. The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces a temporary function with arguments {{args}} instead of {{cols}}, creating the confusing documentation. was: Simple aggregation functions which take column names [[cols]] as varargs arguments show up in documentation with the argument [[args]], but their documentation refers to [[cols]]. The discrepancy is caused by an annotation, [[df_varargs_api]], which produces a temporary function with arguments [[args]] instead of [[cols]], creating the confusing documentation. > GroupedData varargs arguments misnamed > -- > > Key: SPARK-15972 > URL: https://issues.apache.org/jira/browse/SPARK-15972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15973) GroupedData.pivot documentation off
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332364#comment-15332364 ] Sean Owen commented on SPARK-15973: --- Please group your 3 JIRAs into one. They sound so similar that they should not be separate issues. I'll resolve 2 as duplicates shortly. > GroupedData.pivot documentation off > --- > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332369#comment-15332369 ] Sean Owen commented on SPARK-15965: --- CC [~steve_l] but I think this is your classpath issue or a Hadoop issue, not Spark. > No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1 > - > > Key: SPARK-15965 > URL: https://issues.apache.org/jira/browse/SPARK-15965 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.1 > Environment: Debian GNU/Linux 8 > java version "1.7.0_79" >Reporter: thauvin damien > Original Estimate: 8h > Remaining Estimate: 8h > > The spark programming-guide explain that Spark can create distributed > datasets on Amazon S3 . > But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or > s3a. > sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH") > sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", > "xxx") > val > lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz") > java.lang.RuntimeException: java.lang.ClassNotFoundException: Class > org.apache.hadoop.fs.s3a.S3AFileSystem not found > Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with > hadoop.7.2 . > I understand this is an Hadoop Issue (SPARK-7442) but can you make some > documentation to explain what jar we need to add and where ? ( for standalone > installation) . > "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? > What env variable we need to set and what file we need to modifiy . > Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable > "spark.driver.extraClassPath" and "spark.executor.extraClassPath" > But Still Works with spark-1.6.1 pre build with hadoop2.4 > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15973: -- Summary: Fix GroupedData Documentation (was: GroupedData.pivot documentation off) > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15966) Fix markdown for Spark Monitoring
[ https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332371#comment-15332371 ] Sean Owen commented on SPARK-15966: --- Please make a more descriptive JIRA, and/or submit a PR. > Fix markdown for Spark Monitoring > - > > Key: SPARK-15966 > URL: https://issues.apache.org/jira/browse/SPARK-15966 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Dhruve Ashar >Priority: Trivial > > The markdown for Spark monitoring needs to be fixed. > http://spark.apache.org/docs/2.0.0-preview/monitoring.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-15973: -- Description: (1) {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest python comments, which messes up formatting in the documentation as well as the doctests themselves. A PR resolving this should probably resolve the other places this happens in pyspark. (2) Simple aggregation functions which take column names {{cols}} as varargs arguments show up in documentation with the argument {{args}}, but their documentation refers to {{cols}}. The discrepancy is caused by an annotation, {{df_varargs_api}}, which produces a temporary function with arguments {{args}} instead of {{cols}}, creating the confusing documentation. (3) The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around as the member variable {{self._jdf}}, which is exactly the same as {{pyspark.sql.DataFrame}}, when referring its object. The acronym is incorrect, standing for "Java DataFrame" instead of what should be "Java GroupedData". As such, the name should be changed to {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the java object is referred to as exactly {{jgd}} was: {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest python comments, which messes up formatting in the documentation as well as the doctests themselves. A PR resolving this should probably resolve the other places this happens in pyspark. > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15972) GroupedData varargs arguments misnamed
[ https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg resolved SPARK-15972. --- Resolution: Duplicate > GroupedData varargs arguments misnamed > -- > > Key: SPARK-15972 > URL: https://issues.apache.org/jira/browse/SPARK-15972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15972) GroupedData varargs arguments misnamed
[ https://issues.apache.org/jira/browse/SPARK-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg closed SPARK-15972. - > GroupedData varargs arguments misnamed > -- > > Key: SPARK-15972 > URL: https://issues.apache.org/jira/browse/SPARK-15972 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15971) GroupedData's member incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg closed SPARK-15971. - > GroupedData's member incorrectly named > -- > > Key: SPARK-15971 > URL: https://issues.apache.org/jira/browse/SPARK-15971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15971) GroupedData's member incorrectly named
[ https://issues.apache.org/jira/browse/SPARK-15971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg resolved SPARK-15971. --- Resolution: Duplicate > GroupedData's member incorrectly named > -- > > Key: SPARK-15971 > URL: https://issues.apache.org/jira/browse/SPARK-15971 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-15947: - Assignee: Xiangrui Meng > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. Note that this includes > loading old saved models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15973) Fix GroupedData Documentation
[ https://issues.apache.org/jira/browse/SPARK-15973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332374#comment-15332374 ] Vladimir Feinberg commented on SPARK-15973: --- Done > Fix GroupedData Documentation > - > > Key: SPARK-15973 > URL: https://issues.apache.org/jira/browse/SPARK-15973 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.1.0 >Reporter: Vladimir Feinberg >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > (1) > {{GroupedData.pivot}} documenation uses {{//}} instead of {{#}} for doctest > python comments, which messes up formatting in the documentation as well as > the doctests themselves. > A PR resolving this should probably resolve the other places this happens in > pyspark. > (2) > Simple aggregation functions which take column names {{cols}} as varargs > arguments show up in documentation with the argument {{args}}, but their > documentation refers to {{cols}}. > The discrepancy is caused by an annotation, {{df_varargs_api}}, which > produces a temporary function with arguments {{args}} instead of {{cols}}, > creating the confusing documentation. > (3) > The {{pyspark.sql.GroupedData}} object calls the Java object it wraps around > as the member variable {{self._jdf}}, which is exactly the same as > {{pyspark.sql.DataFrame}}, when referring its object. > The acronym is incorrect, standing for "Java DataFrame" instead of what > should be "Java GroupedData". As such, the name should be changed to > {{self._jgd}} - in fact, in the {{DataFrame.groupBy}} implementation, the > java object is referred to as exactly {{jgd}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) Python UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15811: --- Description: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code} from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. was: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. > Python UDFs do not work in Spark 2.0-preview built with scala 2.10 > -- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Assignee: Davies Liu >Priority: Blocker > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = sqlContext.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3847) Enum.hashCode is only consistent within the same JVM
[ https://issues.apache.org/jira/browse/SPARK-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332453#comment-15332453 ] Brett Stime commented on SPARK-3847: Seems like, rather than a warning about the specifics of enums, the real fix (as mentioned in the highest voted answer to the question posted in the description--http://stackoverflow.com/a/4885292/93345) is to stop comparing hashCodes across distinct JVMs. In the worst case, perhaps the underlying keys should be deserialized in the target JVM and have their hashCodes recomputed. Seems like it should alternately work to create an implementation of hashCode and equals for the serialized bytes. > Enum.hashCode is only consistent within the same JVM > > > Key: SPARK-3847 > URL: https://issues.apache.org/jira/browse/SPARK-3847 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: Oracle JDK 7u51 64bit on Ubuntu 12.04 >Reporter: Nathan Bijnens > Labels: enum > > When using java Enum's as key in some operations the results will be very > unexpected. The issue is that the Java Enum.hashCode returns the > memoryposition, which is different on each JVM. > {code} > messages.filter(_.getHeader.getKind == Kind.EVENT).count > >> 503650 > val tmp = messages.filter(_.getHeader.getKind == Kind.EVENT) > tmp.map(_.getHeader.getKind).countByValue > >> Map(EVENT -> 1389) > {code} > Because it's actually a JVM issue we either should reject with an error enums > as key or implement a workaround. > A good writeup of the issue can be found here (and a workaround): > http://dev.bizo.com/2014/02/beware-enums-in-spark.html > Somewhat more on the hash codes and Enum's: > https://stackoverflow.com/questions/4885095/what-is-the-reason-behind-enum-hashcode > And some issues (most of them rejected) at the Oracle Bug Java database: > - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8050217 > - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7190798 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala
[ https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332475#comment-15332475 ] Xusen Yin commented on SPARK-15574: --- I just finished the prototype of PythonTransformer in Scala as the transformer wrapper of pure Python transformers. It works well if I run it alone from Scala side. But if I chained the PythonTransformer with other transformers/estimators in Pipeline, it fails for lacking of transformSchema in Python side. AFAIK, we need to add transformSchema in Python ML for pure Python PipelineStages. [~josephkb] [~mengxr] > Python meta-algorithms in Scala > --- > > Key: SPARK-15574 > URL: https://issues.apache.org/jira/browse/SPARK-15574 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This is an experimental idea for implementing Python ML meta-algorithms > (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala. > This would require a Scala wrapper for algorithms implemented in Python, > somewhat analogous to Python UDFs. > The benefit of this change would be that we could avoid currently awkward > conversions between Scala/Python meta-algorithms required for persistence. > It would let us have full support for Python persistence and would generally > simplify the implementation within MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332509#comment-15332509 ] Apache Spark commented on SPARK-12492: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/13689 > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Reporter: meiyoula > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15888) Python UDF over aggregate fails
[ https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15888. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13682 [https://github.com/apache/spark/pull/13682] > Python UDF over aggregate fails > --- > > Key: SPARK-15888 > URL: https://issues.apache.org/jira/browse/SPARK-15888 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Vladimir Feinberg >Assignee: Davies Liu >Priority: Blocker > Fix For: 2.0.0 > > > This looks like a regression from 1.6.1. > The following notebook runs without error in a Spark 1.6.1 cluster, but fails > in 2.0.0: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15767: Assignee: Kai Jiang (was: Apache Spark) > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.decisionTreeRegression(dataframe, formula, ...) . After having > implemented decision tree classification, we could refactor this two into an > API more like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332519#comment-15332519 ] Apache Spark commented on SPARK-15767: -- User 'vectorijk' has created a pull request for this issue: https://github.com/apache/spark/pull/13690 > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.decisionTreeRegression(dataframe, formula, ...) . After having > implemented decision tree classification, we could refactor this two into an > API more like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15767: Assignee: Apache Spark (was: Kai Jiang) > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Apache Spark > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.decisionTreeRegression(dataframe, formula, ...) . After having > implemented decision tree classification, we could refactor this two into an > API more like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15457) Eliminate MLlib 2.0 build warnings from deprecations
[ https://issues.apache.org/jira/browse/SPARK-15457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15457. --- Resolution: Fixed Fix Version/s: 2.0.0 I'm going to go ahead and close this. We can create a new JIRA if there are more to clean up. > Eliminate MLlib 2.0 build warnings from deprecations > > > Key: SPARK-15457 > URL: https://issues.apache.org/jira/browse/SPARK-15457 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > Fix For: 2.0.0 > > > Several classes and methods have been deprecated and are creating lots of > build warnings in branch-2.0. This issue is to identify and fix those items: > * *WithSGD classes: Change to make class not deprecated, object deprecated, > and public class constructor deprecated. Any public use will require a > deprecated API. We need to keep a non-deprecated private API since we cannot > eliminate certain uses: Python API, streaming algs, and examples. > ** Use in PythonMLlibAPI: Change to using private constructors > ** Streaming algs: No warnings after we un-deprecate the classes > ** Examples: Deprecate or change ones which use deprecated APIs > * MulticlassMetrics fields (precision, etc.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332527#comment-15332527 ] Sean Zhong commented on SPARK-15786: Basically, what you described can be shorten to: {code} scala> val ds = Seq((1,1) -> (1, 1)).toDS() res4: org.apache.spark.sql.Dataset[((Int, Int), (Int, Int))] = [_1: struct<_1: int, _2: int>, _2: struct<_1: int, _2: int>] scala> implicit val enc = Encoders.tuple(Encoders.kryo[Option[(Int, Int)]], Encoders.kryo[Option[(Int, Int)]]) enc: org.apache.spark.sql.Encoder[(Option[(Int, Int)], Option[(Int, Int)])] = class[_1[0]: binary, _2[0]: binary] scala> ds.as[(Option[(Int, Int)], Option[(Int, Int)])].collect() {code} > joinWith bytecode generation calling ByteBuffer.wrap with InternalRow > - > > Key: SPARK-15786 > URL: https://issues.apache.org/jira/browse/SPARK-15786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Richard Marscher > > {code}java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 36, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates > are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", > "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, > int)"{code} > I have been trying to use joinWith along with Option data types to get an > approximation of the RDD semantics for outer joins with Dataset to have a > nicer API for Scala. However, using the Dataset.as[] syntax leads to bytecode > generation trying to pass an InternalRow object into the ByteBuffer.wrap > function which expects byte[] with or without a couple int qualifiers. > I have a notebook reproducing this against 2.0 preview in Databricks > Community Edition: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/160347920874755/1039589581260901/673639177603143/latest.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15786) joinWith bytecode generation calling ByteBuffer.wrap with InternalRow
[ https://issues.apache.org/jira/browse/SPARK-15786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332532#comment-15332532 ] Sean Zhong commented on SPARK-15786: The exception stack is: {code} scala> res4.as[(Option[(Int, Int)], Option[(Int, Int)])].collect() 16/06/15 13:46:18 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 109: No applicable constructor/method found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private MutableRow mutableRow; /* 009 */ private org.apache.spark.serializer.KryoSerializerInstance serializer; /* 010 */ private org.apache.spark.serializer.KryoSerializerInstance serializer1; /* 011 */ /* 012 */ /* 013 */ public SpecificSafeProjection(Object[] references) { /* 014 */ this.references = references; /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; /* 016 */ /* 017 */ if (org.apache.spark.SparkEnv.get() == null) { /* 018 */ serializer = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(new org.apache.spark.SparkConf()).newInstance(); /* 019 */ } else { /* 020 */ serializer = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance(); /* 021 */ } /* 022 */ /* 023 */ /* 024 */ if (org.apache.spark.SparkEnv.get() == null) { /* 025 */ serializer1 = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(new org.apache.spark.SparkConf()).newInstance(); /* 026 */ } else { /* 027 */ serializer1 = (org.apache.spark.serializer.KryoSerializerInstance) new org.apache.spark.serializer.KryoSerializer(org.apache.spark.SparkEnv.get().conf()).newInstance(); /* 028 */ } /* 029 */ /* 030 */ } /* 031 */ /* 032 */ public java.lang.Object apply(java.lang.Object _i) { /* 033 */ InternalRow i = (InternalRow) _i; /* 034 */ /* 035 */ boolean isNull2 = i.isNullAt(0); /* 036 */ InternalRow value2 = isNull2 ? null : (i.getStruct(0, 2)); /* 037 */ final scala.Option value1 = isNull2 ? null : (scala.Option) serializer.deserialize(java.nio.ByteBuffer.wrap(value2), null); /* 038 */ /* 039 */ boolean isNull4 = i.isNullAt(1); /* 040 */ InternalRow value4 = isNull4 ? null : (i.getStruct(1, 2)); /* 041 */ final scala.Option value3 = isNull4 ? null : (scala.Option) serializer1.deserialize(java.nio.ByteBuffer.wrap(value4), null); /* 042 */ /* 043 */ /* 044 */ final scala.Tuple2 value = false ? null : new scala.Tuple2(value1, value3); /* 045 */ if (false) { /* 046 */ mutableRow.setNullAt(0); /* 047 */ } else { /* 048 */ /* 049 */ mutableRow.update(0, value); /* 050 */ } /* 051 */ /* 052 */ return mutableRow; /* 053 */ } /* 054 */ } org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 109: No applicable constructor/method found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; candidates are: "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer java.nio.ByteBuffer.wrap(byte[], int, int)" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7559) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:5663) at org.codehaus.janino.UnitCompiler.access$13800(UnitCompiler.java:185) at org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:5132) at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3971) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:5159) at org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7533) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7429) at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:7333) at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3873) at org.codehaus.janino.UnitCompiler.access$6900
[jira] [Commented] (SPARK-1051) On Yarn, executors don't doAs as submitting user
[ https://issues.apache.org/jira/browse/SPARK-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332539#comment-15332539 ] Apache Spark commented on SPARK-1051: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/29 > On Yarn, executors don't doAs as submitting user > > > Key: SPARK-1051 > URL: https://issues.apache.org/jira/browse/SPARK-1051 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Pérez González >Assignee: Sandy Ryza > Fix For: 0.9.1, 1.0.0 > > > This means that they can't write/read from files that the yarn user doesn't > have permissions to but the submitting user does. I don't think this isn't a > problem when running with Hadoop security, because the executor processes > will be run as the submitting user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15906: -- Issue Type: Improvement (was: New Feature) > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15906: -- Priority: Minor (was: Major) > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332536#comment-15332536 ] Joseph K. Bradley commented on SPARK-15906: --- Can you provide more info about what the proposal does in this JIRA? Also, do you have more references to indicate this is needed, such as other ML libraries with this improvement or other papers showing similar results? > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: MIN-FU YANG > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys
[ https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12978: Target Version/s: 2.1.0 (was: 2.0.0) > Skip unnecessary final group-by when input data already clustered with > group-by keys > > > Key: SPARK-12978 > URL: https://issues.apache.org/jira/browse/SPARK-12978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > > This ticket targets the optimization to skip an unnecessary group-by > operation below; > Without opt.: > {code} > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], > output=[col0#159,sum#200,sum#201,count#202L]) >+- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], > InMemoryRelation [col0#159,col1#160,col2#161], true, 1, > StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None > {code} > With opt.: > {code} > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation > [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, > true, 1), ConvertToUnsafe, None > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15715) Altering partition storage information doesn't work in Hive
[ https://issues.apache.org/jira/browse/SPARK-15715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15715. - Resolution: Fixed Fix Version/s: 2.0.0 > Altering partition storage information doesn't work in Hive > --- > > Key: SPARK-15715 > URL: https://issues.apache.org/jira/browse/SPARK-15715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > In HiveClientImpl > {code} > private def toHivePartition( > p: CatalogTablePartition, > ht: HiveTable): HivePartition = { > new HivePartition(ht, p.spec.asJava, p.storage.locationUri.map { l => new > Path(l) }.orNull) > } > {code} > Other than the location, we don't even store any of the storage information > in the metastore: output format, input format, serde, serde props. The result > is that doing something like the following doesn't actually do anything: > {code} > ALTER TABLE boxes PARTITION (width=3) > SET SERDE 'com.sparkbricks.serde.ColumnarSerDe' > WITH SERDEPROPERTIES ('compress'='true') > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org