[jira] [Commented] (SPARK-9285) Remove InternalRow's inheritance from Row
[ https://issues.apache.org/jira/browse/SPARK-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639456#comment-14639456 ] Apache Spark commented on SPARK-9285: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7626 Remove InternalRow's inheritance from Row - Key: SPARK-9285 URL: https://issues.apache.org/jira/browse/SPARK-9285 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is a big change, but it lets us use the type information to prevent accidentally passing internal types to external types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9285) Remove InternalRow's inheritance from Row
[ https://issues.apache.org/jira/browse/SPARK-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9285: --- Assignee: Apache Spark (was: Reynold Xin) Remove InternalRow's inheritance from Row - Key: SPARK-9285 URL: https://issues.apache.org/jira/browse/SPARK-9285 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Apache Spark It is a big change, but it lets us use the type information to prevent accidentally passing internal types to external types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9285) Remove InternalRow's inheritance from Row
[ https://issues.apache.org/jira/browse/SPARK-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9285: --- Assignee: Reynold Xin (was: Apache Spark) Remove InternalRow's inheritance from Row - Key: SPARK-9285 URL: https://issues.apache.org/jira/browse/SPARK-9285 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is a big change, but it lets us use the type information to prevent accidentally passing internal types to external types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9287) Speedup unit test of Date expressions
Davies Liu created SPARK-9287: - Summary: Speedup unit test of Date expressions Key: SPARK-9287 URL: https://issues.apache.org/jira/browse/SPARK-9287 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu It tried hard to cover many corner cases, but slow down unit tests a lot (take 30 seconds on my Macbook). We could ignore most of them now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9286) Methods in Unevaluable should be final
[ https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9286: --- Assignee: Apache Spark (was: Josh Rosen) Methods in Unevaluable should be final -- Key: SPARK-9286 URL: https://issues.apache.org/jira/browse/SPARK-9286 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark Priority: Trivial The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait should be marked as {{final}} and we should fix any cases where they are overridden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9284) Remove tests' dependency on the assembly
[ https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9284: --- Assignee: Apache Spark Remove tests' dependency on the assembly Key: SPARK-9284 URL: https://issues.apache.org/jira/browse/SPARK-9284 Project: Spark Issue Type: Improvement Components: Tests Reporter: Marcelo Vanzin Assignee: Apache Spark Priority: Minor Some tests - in particular tests that have to spawn child processes - currently rely on the generated Spark assembly to run properly. This is sub-optimal for a few reasons: - Users have to use an unnatural package everything first, then run tests approach - Sometimes tests are run using old code because the user forgot to rebuild the assembly The latter is particularly annoying in {{YarnClusterSuite}}. If you modify some code outside of the {{yarn/}} module, you have to rebuild the whole assembly before that test picks it up. We should make all tests run without the need to have an assembly around, making sure that they always pick up the latest code compiled by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9284) Remove tests' dependency on the assembly
[ https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9284: --- Assignee: (was: Apache Spark) Remove tests' dependency on the assembly Key: SPARK-9284 URL: https://issues.apache.org/jira/browse/SPARK-9284 Project: Spark Issue Type: Improvement Components: Tests Reporter: Marcelo Vanzin Priority: Minor Some tests - in particular tests that have to spawn child processes - currently rely on the generated Spark assembly to run properly. This is sub-optimal for a few reasons: - Users have to use an unnatural package everything first, then run tests approach - Sometimes tests are run using old code because the user forgot to rebuild the assembly The latter is particularly annoying in {{YarnClusterSuite}}. If you modify some code outside of the {{yarn/}} module, you have to rebuild the whole assembly before that test picks it up. We should make all tests run without the need to have an assembly around, making sure that they always pick up the latest code compiled by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9284) Remove tests' dependency on the assembly
[ https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639671#comment-14639671 ] Apache Spark commented on SPARK-9284: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7629 Remove tests' dependency on the assembly Key: SPARK-9284 URL: https://issues.apache.org/jira/browse/SPARK-9284 Project: Spark Issue Type: Improvement Components: Tests Reporter: Marcelo Vanzin Priority: Minor Some tests - in particular tests that have to spawn child processes - currently rely on the generated Spark assembly to run properly. This is sub-optimal for a few reasons: - Users have to use an unnatural package everything first, then run tests approach - Sometimes tests are run using old code because the user forgot to rebuild the assembly The latter is particularly annoying in {{YarnClusterSuite}}. If you modify some code outside of the {{yarn/}} module, you have to rebuild the whole assembly before that test picks it up. We should make all tests run without the need to have an assembly around, making sure that they always pick up the latest code compiled by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9290) DateExpressionsSuite is slow to run
Reynold Xin created SPARK-9290: -- Summary: DateExpressionsSuite is slow to run Key: SPARK-9290 URL: https://issues.apache.org/jira/browse/SPARK-9290 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin We are running way too many test cases in here. {code} [info] - DayOfYear (16 seconds, 998 milliseconds) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9286) Methods in Unevaluable should be final
[ https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-9286. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7627 [https://github.com/apache/spark/pull/7627] Methods in Unevaluable should be final -- Key: SPARK-9286 URL: https://issues.apache.org/jira/browse/SPARK-9286 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Trivial Fix For: 1.5.0 The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait should be marked as {{final}} and we should fix any cases where they are overridden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9286) Methods in Unevaluable should be final
[ https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639507#comment-14639507 ] Apache Spark commented on SPARK-9286: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7627 Methods in Unevaluable should be final -- Key: SPARK-9286 URL: https://issues.apache.org/jira/browse/SPARK-9286 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Trivial The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait should be marked as {{final}} and we should fix any cases where they are overridden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9286) Methods in Unevaluable should be final
[ https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9286: --- Assignee: Josh Rosen (was: Apache Spark) Methods in Unevaluable should be final -- Key: SPARK-9286 URL: https://issues.apache.org/jira/browse/SPARK-9286 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Trivial The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait should be marked as {{final}} and we should fix any cases where they are overridden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9269) Add Set to the matching type in ArrayConverter
[ https://issues.apache.org/jira/browse/SPARK-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639603#comment-14639603 ] Apache Spark commented on SPARK-9269: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/7628 Add Set to the matching type in ArrayConverter --- Key: SPARK-9269 URL: https://issues.apache.org/jira/browse/SPARK-9269 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Alex Liu When the data is scala set, the following error is thrown. {code} scala.MatchError: Set() (of class scala.collection.immutable.Set$EmptySet$) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:136) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:62) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} We need add Set to the matching type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9269) Add Set to the matching type in ArrayConverter
[ https://issues.apache.org/jira/browse/SPARK-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9269: --- Assignee: Apache Spark Add Set to the matching type in ArrayConverter --- Key: SPARK-9269 URL: https://issues.apache.org/jira/browse/SPARK-9269 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Alex Liu Assignee: Apache Spark When the data is scala set, the following error is thrown. {code} scala.MatchError: Set() (of class scala.collection.immutable.Set$EmptySet$) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:136) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:62) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} We need add Set to the matching type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9269) Add Set to the matching type in ArrayConverter
[ https://issues.apache.org/jira/browse/SPARK-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9269: --- Assignee: (was: Apache Spark) Add Set to the matching type in ArrayConverter --- Key: SPARK-9269 URL: https://issues.apache.org/jira/browse/SPARK-9269 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Alex Liu When the data is scala set, the following error is thrown. {code} scala.MatchError: Set() (of class scala.collection.immutable.Set$EmptySet$) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:136) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:62) at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} We need add Set to the matching type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9285) Remove InternalRow's inheritance from Row
Reynold Xin created SPARK-9285: -- Summary: Remove InternalRow's inheritance from Row Key: SPARK-9285 URL: https://issues.apache.org/jira/browse/SPARK-9285 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is a big change, but it lets us use the type information to prevent accidentally passing internal types to external types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9284) Remove tests' dependency on the assembly
Marcelo Vanzin created SPARK-9284: - Summary: Remove tests' dependency on the assembly Key: SPARK-9284 URL: https://issues.apache.org/jira/browse/SPARK-9284 Project: Spark Issue Type: Improvement Components: Tests Reporter: Marcelo Vanzin Priority: Minor Some tests - in particular tests that have to spawn child processes - currently rely on the generated Spark assembly to run properly. This is sub-optimal for a few reasons: - Users have to use an unnatural package everything first, then run tests approach - Sometimes tests are run using old code because the user forgot to rebuild the assembly The latter is particularly annoying in {{YarnClusterSuite}}. If you modify some code outside of the {{yarn/}} module, you have to rebuild the whole assembly before that test picks it up. We should make all tests run without the need to have an assembly around, making sure that they always pick up the latest code compiled by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9284) Remove tests' dependency on the assembly
[ https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639448#comment-14639448 ] Marcelo Vanzin commented on SPARK-9284: --- BTW I have a patch that does this, I'm currently running some more test iterations on it. Remove tests' dependency on the assembly Key: SPARK-9284 URL: https://issues.apache.org/jira/browse/SPARK-9284 Project: Spark Issue Type: Improvement Components: Tests Reporter: Marcelo Vanzin Priority: Minor Some tests - in particular tests that have to spawn child processes - currently rely on the generated Spark assembly to run properly. This is sub-optimal for a few reasons: - Users have to use an unnatural package everything first, then run tests approach - Sometimes tests are run using old code because the user forgot to rebuild the assembly The latter is particularly annoying in {{YarnClusterSuite}}. If you modify some code outside of the {{yarn/}} module, you have to rebuild the whole assembly before that test picks it up. We should make all tests run without the need to have an assembly around, making sure that they always pick up the latest code compiled by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9288) Improve test speed
Reynold Xin created SPARK-9288: -- Summary: Improve test speed Key: SPARK-9288 URL: https://issues.apache.org/jira/browse/SPARK-9288 Project: Spark Issue Type: Umbrella Components: Build, Tests Reporter: Reynold Xin This is an umbrella ticket to track test cases that are slow to run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9289) OrcPartitionDiscoverySuite is slow to run
Reynold Xin created SPARK-9289: -- Summary: OrcPartitionDiscoverySuite is slow to run Key: SPARK-9289 URL: https://issues.apache.org/jira/browse/SPARK-9289 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin {code} [info] - read partitioned table - normal case (18 seconds, 557 milliseconds) [info] - read partitioned table - partition key included in orc file (5 seconds, 160 milliseconds) [info] - read partitioned table - with nulls (4 seconds, 69 milliseconds) [info] - read partitioned table - with nulls and partition keys are included in Orc file (3 seconds, 218 milliseconds) {code} Does the unit test really need to run for 18 secs, 5 secs, 4 secs, and 3 secs? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9276) ThriftServer process can't stop if using command yarn application -kill appid
meiyoula created SPARK-9276: --- Summary: ThriftServer process can't stop if using command yarn application -kill appid Key: SPARK-9276 URL: https://issues.apache.org/jira/browse/SPARK-9276 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula Reproduction Steps: 1. starting thriftserver 2. using beeline to connect thriftserver 3.using commad “yarn application -kill appid” or from yarn webui to kill the application of thriftserver 4.ApplicationMaster has stopped, but the driver process will always be there Reproduction Condition: There must have client connect to thriftserver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5447) Replace reference to SchemaRDD with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638645#comment-14638645 ] Apache Spark commented on SPARK-5447: - User 'darroyocazorla' has created a pull request for this issue: https://github.com/apache/spark/pull/7618 Replace reference to SchemaRDD with DataFrame - Key: SPARK-5447 URL: https://issues.apache.org/jira/browse/SPARK-5447 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 We renamed SchemaRDD - DataFrame, but internally various code still reference SchemaRDD in JavaDoc and comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Vykhodtsev updated SPARK-9277: - Attachment: SparseVector test.ipynb SparseVector test.html Attached is the notebook with the scenario and the full message: SparseVector constructor must throw an error when declared number of elements less than array lenght Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9212) update Netty version to 4.0.29.Final for Netty Metrics
[ https://issues.apache.org/jira/browse/SPARK-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9212: - Assignee: Zhang, Liye Priority: Trivial (was: Major) update Netty version to 4.0.29.Final for Netty Metrics Key: SPARK-9212 URL: https://issues.apache.org/jira/browse/SPARK-9212 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Trivial Fix For: 1.5.0 In Netty version 4.0.29.Final, metrics for PooledByteBufAllocator is exposed directly, so that no need to get the memory data info in a hack way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data
Steve Lindemann created SPARK-9278: -- Summary: DataFrameWriter.insertInto inserts incorrect data Key: SPARK-9278 URL: https://issues.apache.org/jira/browse/SPARK-9278 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Linux, S3, Hive Metastore Reporter: Steve Lindemann After creating a partitioned Hive table (stored as Parquet) via the DataFrameWriter.createTable command, subsequent attempts to insert additional data into new partitions of this table result in inserting incorrect data rows. Reordering the columns in the data to be written seems to avoid this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght
Andrey Vykhodtsev created SPARK-9277: Summary: SparseVector constructor must throw an error when declared number of elements less than array lenght Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4024) Remember user preferences for metrics to show in the UI
[ https://issues.apache.org/jira/browse/SPARK-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-4024. - Resolution: Fixed Fix Version/s: 1.5.0 Resolved in https://github.com/apache/spark/pull/7399 Remember user preferences for metrics to show in the UI --- Key: SPARK-4024 URL: https://issues.apache.org/jira/browse/SPARK-4024 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Kay Ousterhout Priority: Minor Fix For: 1.5.0 We should remember the metrics a user has previously chosen to display for each stage, so that the user doesn't need to reselect interesting metric each time they open a stage detail page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9212) update Netty version to 4.0.29.Final for Netty Metrics
[ https://issues.apache.org/jira/browse/SPARK-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9212. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7562 [https://github.com/apache/spark/pull/7562] update Netty version to 4.0.29.Final for Netty Metrics Key: SPARK-9212 URL: https://issues.apache.org/jira/browse/SPARK-9212 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Zhang, Liye Fix For: 1.5.0 In Netty version 4.0.29.Final, metrics for PooledByteBufAllocator is exposed directly, so that no need to get the memory data info in a hack way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9276) ThriftServer process can't stop if using command yarn application -kill appid
[ https://issues.apache.org/jira/browse/SPARK-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638714#comment-14638714 ] Sean Owen commented on SPARK-9276: -- I might misunderstand this, but, is that surprising? you used YARN to kill the AM, and the AM stopped. YARN can't kill other processes. ThriftServer process can't stop if using command yarn application -kill appid --- Key: SPARK-9276 URL: https://issues.apache.org/jira/browse/SPARK-9276 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula Reproduction Steps: 1. starting thriftserver 2. using beeline to connect thriftserver 3.using commad “yarn application -kill appid” or from yarn webui to kill the application of thriftserver 4.ApplicationMaster has stopped, but the driver process will always be there Reproduction Condition: There must have client connect to thriftserver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9249) local variable assigned but may not be used
[ https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Ishikawa updated SPARK-9249: --- Description: local variable assigned but may not be used For example: {noformat} R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ {noformat} was:local variable assigned but may not be used local variable assigned but may not be used --- Key: SPARK-9249 URL: https://issues.apache.org/jira/browse/SPARK-9249 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor local variable assigned but may not be used For example: {noformat} R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming
[ https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-8092. -- Resolution: Fixed Fix Version/s: 1.5.0 OneVsRest doesn't allow flexibility in label/ feature column renaming - Key: SPARK-8092 URL: https://issues.apache.org/jira/browse/SPARK-8092 Project: Spark Issue Type: Bug Components: ML Reporter: Ram Sriharsha Assignee: Ram Sriharsha Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9249) local variable assigned but may not be used
[ https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639945#comment-14639945 ] Yu Ishikawa edited comment on SPARK-9249 at 7/24/15 5:22 AM: - [~chanchal.spark] Yes. I think we should remove local variables which are not used, such as below. https://github.com/apache/spark/blob/branch-1.4/R/pkg/R/deserialize.R#L104 was (Author: yuu.ishik...@gmail.com): [~chanchal.spark] Yes. I think we should remove local variables which is not used, such as below. https://github.com/apache/spark/blob/branch-1.4/R/pkg/R/deserialize.R#L104 local variable assigned but may not be used --- Key: SPARK-9249 URL: https://issues.apache.org/jira/browse/SPARK-9249 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor local variable assigned but may not be used For example: {noformat} R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7045) Word2Vec: avoid intermediate representation when creating model
[ https://issues.apache.org/jira/browse/SPARK-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7045: - Shepherd: Joseph K. Bradley Target Version/s: 1.5.0 Word2Vec: avoid intermediate representation when creating model --- Key: SPARK-7045 URL: https://issues.apache.org/jira/browse/SPARK-7045 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Priority: Minor Word2VecModel now stores the word vectors as a single, flat array; Word2Vec does as well. However, when Word2Vec creates the model, it builds an intermediate representation. We should skip that intermediate representation. However, it will be nice to create a public constructor for Word2VecModel which takes that intermediate representation (a Map from String words to their Vectors), since it's a user-friendly representation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7045) Word2Vec: avoid intermediate representation when creating model
[ https://issues.apache.org/jira/browse/SPARK-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7045: - Assignee: Manoj Kumar Word2Vec: avoid intermediate representation when creating model --- Key: SPARK-7045 URL: https://issues.apache.org/jira/browse/SPARK-7045 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Priority: Minor Word2VecModel now stores the word vectors as a single, flat array; Word2Vec does as well. However, when Word2Vec creates the model, it builds an intermediate representation. We should skip that intermediate representation. However, it will be nice to create a public constructor for Word2VecModel which takes that intermediate representation (a Map from String words to their Vectors), since it's a user-friendly representation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns
[ https://issues.apache.org/jira/browse/SPARK-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9293: --- Assignee: Apache Spark (was: Josh Rosen) Analysis should detect when set operations are performed on tables with different numbers of columns Key: SPARK-9293 URL: https://issues.apache.org/jira/browse/SPARK-9293 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Apache Spark Our SQL analyzer doesn't always enforce that set operations are only performed on relations with the same number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns
[ https://issues.apache.org/jira/browse/SPARK-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9293: --- Assignee: Josh Rosen (was: Apache Spark) Analysis should detect when set operations are performed on tables with different numbers of columns Key: SPARK-9293 URL: https://issues.apache.org/jira/browse/SPARK-9293 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Our SQL analyzer doesn't always enforce that set operations are only performed on relations with the same number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8564) Add the Python API for Kinesis
[ https://issues.apache.org/jira/browse/SPARK-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-8564: --- Target Version/s: 1.5.0 Add the Python API for Kinesis -- Key: SPARK-8564 URL: https://issues.apache.org/jira/browse/SPARK-8564 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5373) literal in agg grouping expressioons leads to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639897#comment-14639897 ] Apache Spark commented on SPARK-5373: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7583 literal in agg grouping expressioons leads to incorrect result --- Key: SPARK-5373 URL: https://issues.apache.org/jira/browse/SPARK-5373 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Fei Wang Fix For: 1.3.0 select key, count( * ) from src group by key, 1 will get the wrong answer! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639914#comment-14639914 ] Yin Huai commented on SPARK-6548: - [~JihongMA] Will you have time to implement stddev based on our new aggregate function interface? {{AlgebraicAggregate}} is the abstract class to use and you can take a look at {{org.apache.spark.sql.catalyst.expressions.aggregate.Average}} as an example. Let me know if you have any question. Thanks! stddev_pop and stddev_samp aggregate functions -- Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9255) Timestamp handling incorrect for Spark 1.4.1 on Linux
[ https://issues.apache.org/jira/browse/SPARK-9255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639949#comment-14639949 ] Paul Wu commented on SPARK-9255: [~srowen] I don't think it is due to version difference: The same code runs on Release 1.3.0 correctly on Red Linux. This bug was introduced after 1.3.0. Timestamp handling incorrect for Spark 1.4.1 on Linux - Key: SPARK-9255 URL: https://issues.apache.org/jira/browse/SPARK-9255 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: Redhat Linux, Java 8.0 and Spark 1.4.1 release. Reporter: Paul Wu Attachments: timestamp_bug.zip This is a very strange case involving timestamp I can run the program on Windows using dev pom.xml (1.4.1) or 1.4.1 or 1.3.0 release downloaded from Apache without issues , but when I ran it on Spark 1.4.1 release either downloaded from Apache or the version built with scala 2.11 on redhat linux, it has the following error (the code I used is after this stack trace): 15/07/22 12:02:50 ERROR Executor 96: Exception in task 0.0 in stage 0.0 (TID 0) java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: reflective compilation has failed: value is not a member of TimestampType.this.InternalType at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000) at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102) at org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170) at org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:261) at org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:246) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: scala.tools.reflect.ToolBoxError: reflective compilation has failed: value is not a member of TimestampType.this.InternalType at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.throwIfErrors(ToolBoxFactory.scala:316) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:198) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252) at scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429) at
[jira] [Commented] (SPARK-9249) local variable assigned but may not be used
[ https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639945#comment-14639945 ] Yu Ishikawa commented on SPARK-9249: [~chanchal.spark] Yes. I think we should remove local variables which is not used, such as below. https://github.com/apache/spark/blob/branch-1.4/R/pkg/R/deserialize.R#L104 local variable assigned but may not be used --- Key: SPARK-9249 URL: https://issues.apache.org/jira/browse/SPARK-9249 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor local variable assigned but may not be used For example: {noformat} R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9002) KryoSerializer initialization does not include 'Array[Int]'
[ https://issues.apache.org/jira/browse/SPARK-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639940#comment-14639940 ] Randy Kerber commented on SPARK-9002: - That's what I was thinking when I created the Issue. Looking at KryoSerializer object's toRegister(.) method it looked like Array[Int] class was just inadvertently missed, so I thought this might be a perfect simple fix for a first contribution. But then as I continued my attempt to transition from Java serializer to Kryo, I kept hitting new Class is not registered errors, one after one. First Array[String]. Then Array[Map.empty], then Array[Seq.empty], then Array[TreeMap], plus several other flavors of empty collection classes, Array[Tuple3], DataFrame, Row, even Array[GenericRowWithSchema]. Like swatting cockroaches -- no end to it. Started to wonder if this was a futile process. I could use some guidance here as to how it makes sense to proceed. Is it worthwhile to add the 15-20 classes I've found so far, knowing there will almost certainly be more? Or drop it, because this route cannot possibly be a complete fix, and/or a more comprehensive solution for Kryo is already in the works? KryoSerializer initialization does not include 'Array[Int]' --- Key: SPARK-9002 URL: https://issues.apache.org/jira/browse/SPARK-9002 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: MacBook Pro, OS X 10.10.4, Spark 1.4.0, master=local[*], IntelliJ IDEA. Reporter: Randy Kerber Priority: Minor Labels: easyfix, newbie Original Estimate: 1h Remaining Estimate: 1h The object KryoSerializer (inside KryoRegistrator.scala) contains a list of classes that are automatically registered with Kryo. That list includes: Array\[Byte], Array\[Long], and Array\[Short]. Array\[Int] is missing from that list. Can't think of any good reason it shouldn't also be included. Note: This is first time creating an issue or contributing code to an apache project. Apologies if I'm not following the process correct. Appreciate any guidance or assistance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9302) collect()/head() failed with JSON of some format
Sun Rui created SPARK-9302: -- Summary: collect()/head() failed with JSON of some format Key: SPARK-9302 URL: https://issues.apache.org/jira/browse/SPARK-9302 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.1, 1.4.0 Reporter: Sun Rui Reported in the mailing list by Exie tfind...@prodevelop.com.au: {noformat} A sample record in raw JSON looks like this: {version: 1,event: view,timestamp: 1427846422377,system: DCDS,asset: 6404476,assetType: myType,assetCategory: myCategory,extras: [{name: videoSource,value: mySource},{name: playerType,value: Article},{name: duration,value: 202088}],trackingId: 155629a0-d802-11e4-13ee-6884e43d6000,ipAddress: 165.69.2.4,title: myTitle} head(mydf) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame show(mydf) DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, assetType:string, event:string, extras:arraystructlt;name:string,value:string, ipAddress:string, memberId:string, system:string, timestamp:bigint, title:string, trackingId:string, version:bigint] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9249) local variable assigned but may not be used
[ https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639955#comment-14639955 ] Yu Ishikawa commented on SPARK-9249: I'm working this issue. local variable assigned but may not be used --- Key: SPARK-9249 URL: https://issues.apache.org/jira/browse/SPARK-9249 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor local variable assigned but may not be used For example: {noformat} R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be used data - readBin(con, raw(), as.integer(dataLen), endian = big) ^~~~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9281) Parse literals as decimal in SQL
[ https://issues.apache.org/jira/browse/SPARK-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-9281: - Assignee: Davies Liu Parse literals as decimal in SQL Key: SPARK-9281 URL: https://issues.apache.org/jira/browse/SPARK-9281 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Davies Liu Assignee: Davies Liu Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double. It's better to parse the float number as decimal (we will know exactly the precision and scale is), it also work well with double. BTW, this is a break change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9291) Conversion is applied twice on partitioned data sources
Reynold Xin created SPARK-9291: -- Summary: Conversion is applied twice on partitioned data sources Key: SPARK-9291 URL: https://issues.apache.org/jira/browse/SPARK-9291 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker We currently apply conversion twice: once in DataSourceStrategy (search for toCatalystRDD), and another in HadoopFsRelation.buildScan (search for rowToRowRdd). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8906) Move all internal data source related classes out of sources package
[ https://issues.apache.org/jira/browse/SPARK-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8906. Resolution: Fixed Fix Version/s: 1.5.0 Move all internal data source related classes out of sources package Key: SPARK-8906 URL: https://issues.apache.org/jira/browse/SPARK-8906 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Move all of them into execution package for better private visibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9207) Turn on Parquet filter push-down by default
[ https://issues.apache.org/jira/browse/SPARK-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9207. Resolution: Fixed Fix Version/s: 1.5.0 Turn on Parquet filter push-down by default --- Key: SPARK-9207 URL: https://issues.apache.org/jira/browse/SPARK-9207 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical Fix For: 1.5.0 We turned off Parquet filter push-down by default in Spark 1.4.0 and prior versions because of some Parquet side bugs in Parquet 1.6.0rc3. Now we've upgraded to 1.7.0, which fixed all those bugs. Should turn on Parquet filter push-down by default now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7446) Inverse transform for StringIndexer
[ https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7446: - Shepherd: Joseph K. Bradley Inverse transform for StringIndexer --- Key: SPARK-7446 URL: https://issues.apache.org/jira/browse/SPARK-7446 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: holdenk Priority: Minor It is useful to convert the encoded indices back to their string representation for result inspection. We can add a parameter to StringIndexer/StringIndexModel for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation
[ https://issues.apache.org/jira/browse/SPARK-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639830#comment-14639830 ] Apache Spark commented on SPARK-9294: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7619 cleanup comments, code style, naming typo for the new aggregation - Key: SPARK-9294 URL: https://issues.apache.org/jira/browse/SPARK-9294 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9299) percentile and percentile_approx aggregate functions
Yin Huai created SPARK-9299: --- Summary: percentile and percentile_approx aggregate functions Key: SPARK-9299 URL: https://issues.apache.org/jira/browse/SPARK-9299 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9296: Description: A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. variance, var_pop, and var_samp aggregate functions --- Key: SPARK-9296 URL: https://issues.apache.org/jira/browse/SPARK-9296 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9298) corr aggregate functions
Yin Huai created SPARK-9298: --- Summary: corr aggregate functions Key: SPARK-9298 URL: https://issues.apache.org/jira/browse/SPARK-9298 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9297) covar_pop and covar_samp aggregate functions
Yin Huai created SPARK-9297: --- Summary: covar_pop and covar_samp aggregate functions Key: SPARK-9297 URL: https://issues.apache.org/jira/browse/SPARK-9297 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9292) Analysis should check that join conditions' data types are booleans
[ https://issues.apache.org/jira/browse/SPARK-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9292: --- Assignee: Josh Rosen (was: Apache Spark) Analysis should check that join conditions' data types are booleans --- Key: SPARK-9292 URL: https://issues.apache.org/jira/browse/SPARK-9292 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen The following data frame query should fail analysis but instead fails at runtime: {code} val df = Seq((1, 1)).toDF(a, b) df.join(df, df.col(a)) {code} This should fail with an AnalysisException because the column A is not a boolean and thus cannot be used as a join condition. This can be fixed by adding a new analysis rule which checks that the join condition has BooleanType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9122) spark.mllib regression should support batch predict
[ https://issues.apache.org/jira/browse/SPARK-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9122. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7614 [https://github.com/apache/spark/pull/7614] spark.mllib regression should support batch predict --- Key: SPARK-9122 URL: https://issues.apache.org/jira/browse/SPARK-9122 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang Labels: starter Fix For: 1.5.0 Original Estimate: 72h Remaining Estimate: 72h Currently, in spark.mllib, generalized linear regression models like LinearRegressionModel, RidgeRegressionModel and LassoModel support predict() via: LinearRegressionModelBase.predict, which only takes single rows (feature vectors). It should support batch prediction, taking an RDD. (See other classes which do this already such as NaiveBayesModel.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering
[ https://issues.apache.org/jira/browse/SPARK-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9222: - Target Version/s: 1.5.0 Make class instantiation variables in DistributedLDAModel [private] clustering -- Key: SPARK-9222 URL: https://issues.apache.org/jira/browse/SPARK-9222 Project: Spark Issue Type: Test Components: MLlib Reporter: Manoj Kumar Assignee: Manoj Kumar Priority: Minor This would enable testing the various class variables like docConcentration, topicConcentration etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9296) variance, var_pop, and var_samp aggregate functions
Yin Huai created SPARK-9296: --- Summary: variance, var_pop, and var_samp aggregate functions Key: SPARK-9296 URL: https://issues.apache.org/jira/browse/SPARK-9296 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9300) histogram_numeric aggregate function
Yin Huai created SPARK-9300: --- Summary: histogram_numeric aggregate function Key: SPARK-9300 URL: https://issues.apache.org/jira/browse/SPARK-9300 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7650) Move streaming css and js files to the streaming project
[ https://issues.apache.org/jira/browse/SPARK-7650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-7650. - Resolution: Fixed Fix Version/s: 1.4.0 Move streaming css and js files to the streaming project Key: SPARK-7650 URL: https://issues.apache.org/jira/browse/SPARK-7650 Project: Spark Issue Type: Improvement Components: Streaming, Web UI Reporter: Shixiong Zhu Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9301) collect_set and collect_list aggregate functions
Yin Huai created SPARK-9301: --- Summary: collect_set and collect_list aggregate functions Key: SPARK-9301 URL: https://issues.apache.org/jira/browse/SPARK-9301 Project: Spark Issue Type: Sub-task Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9301) collect_set and collect_list aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9301: Target Version/s: 1.5.0 collect_set and collect_list aggregate functions Key: SPARK-9301 URL: https://issues.apache.org/jira/browse/SPARK-9301 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9292) Analysis should check that join conditions' data types are booleans
[ https://issues.apache.org/jira/browse/SPARK-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9292: --- Assignee: Apache Spark (was: Josh Rosen) Analysis should check that join conditions' data types are booleans --- Key: SPARK-9292 URL: https://issues.apache.org/jira/browse/SPARK-9292 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Apache Spark The following data frame query should fail analysis but instead fails at runtime: {code} val df = Seq((1, 1)).toDF(a, b) df.join(df, df.col(a)) {code} This should fail with an AnalysisException because the column A is not a boolean and thus cannot be used as a join condition. This can be fixed by adding a new analysis rule which checks that the join condition has BooleanType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9292) Analysis should check that join conditions' data types are booleans
[ https://issues.apache.org/jira/browse/SPARK-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639756#comment-14639756 ] Apache Spark commented on SPARK-9292: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7630 Analysis should check that join conditions' data types are booleans --- Key: SPARK-9292 URL: https://issues.apache.org/jira/browse/SPARK-9292 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen The following data frame query should fail analysis but instead fails at runtime: {code} val df = Seq((1, 1)).toDF(a, b) df.join(df, df.col(a)) {code} This should fail with an AnalysisException because the column A is not a boolean and thus cannot be used as a join condition. This can be fixed by adding a new analysis rule which checks that the join condition has BooleanType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9295) Analysis should detect sorting on unsupported column types
Josh Rosen created SPARK-9295: - Summary: Analysis should detect sorting on unsupported column types Key: SPARK-9295 URL: https://issues.apache.org/jira/browse/SPARK-9295 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen The SQL analyzer should report errors for queries that try to sort on columns of unsupported types, such as ArrayType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9292) Analysis should check that join conditions' data types are booleans
Josh Rosen created SPARK-9292: - Summary: Analysis should check that join conditions' data types are booleans Key: SPARK-9292 URL: https://issues.apache.org/jira/browse/SPARK-9292 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen The following data frame query should fail analysis but instead fails at runtime: {code} val df = Seq((1, 1)).toDF(a, b) df.join(df, df.col(a)) {code} This should fail with an AnalysisException because the column A is not a boolean and thus cannot be used as a join condition. This can be fixed by adding a new analysis rule which checks that the join condition has BooleanType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns
Josh Rosen created SPARK-9293: - Summary: Analysis should detect when set operations are performed on tables with different numbers of columns Key: SPARK-9293 URL: https://issues.apache.org/jira/browse/SPARK-9293 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Our SQL analyzer doesn't always enforce that set operations are only performed on relations with the same number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation
[ https://issues.apache.org/jira/browse/SPARK-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9294: --- Assignee: (was: Apache Spark) cleanup comments, code style, naming typo for the new aggregation - Key: SPARK-9294 URL: https://issues.apache.org/jira/browse/SPARK-9294 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation
[ https://issues.apache.org/jira/browse/SPARK-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9294: --- Assignee: Apache Spark cleanup comments, code style, naming typo for the new aggregation - Key: SPARK-9294 URL: https://issues.apache.org/jira/browse/SPARK-9294 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation
Wenchen Fan created SPARK-9294: -- Summary: cleanup comments, code style, naming typo for the new aggregation Key: SPARK-9294 URL: https://issues.apache.org/jira/browse/SPARK-9294 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns
[ https://issues.apache.org/jira/browse/SPARK-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639837#comment-14639837 ] Apache Spark commented on SPARK-9293: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7631 Analysis should detect when set operations are performed on tables with different numbers of columns Key: SPARK-9293 URL: https://issues.apache.org/jira/browse/SPARK-9293 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Our SQL analyzer doesn't always enforce that set operations are only performed on relations with the same number of columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9216) Define KinesisBackedBlockRDDs
[ https://issues.apache.org/jira/browse/SPARK-9216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9216. -- Resolution: Fixed Fix Version/s: 1.5.0 Define KinesisBackedBlockRDDs - Key: SPARK-9216 URL: https://issues.apache.org/jira/browse/SPARK-9216 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Fix For: 1.5.0 https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6548: Issue Type: Sub-task (was: Improvement) Parent: SPARK-4366 Adding stddev to DataFrame functions Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4366) Aggregation Improvement
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639921#comment-14639921 ] Yin Huai commented on SPARK-4366: - Here is a brief instruction on how to implement a built-in aggregate function that supports code-gen. For our new aggregate function interface, {{AlgebraicAggregate}} is the abstract class used for all built-in aggregate functions that support code-gen. Functions based on {{AlgebraicAggregate}} uses our existing expressions to implement operations like initializing aggregation buffer values, updating buffer, merging two buffers, and evaluating results. A good example is {{org.apache.spark.sql.catalyst.expressions.aggregate.Average}}. Since all operations of an {{AlgebraicAggregate}} are built on top of our expression system, the developer does not need to do anything special to support code-gen. It will just work out of the box. For those built-in functions that are hard to be expressed by our expressions, {{AggregateFunction2}} is the abstract class to use. For descriptions of aggregate functions, here are some references: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) https://prestodb.io/docs/current/functions/aggregate.html https://msdn.microsoft.com/en-us/library/ms173454.aspx http://www.postgresql.org/docs/devel/static/functions-aggregate.html Aggregation Improvement --- Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Critical Attachments: aggregatefunction_v1.pdf This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639923#comment-14639923 ] Yin Huai commented on SPARK-6548: - A short introduction on how to build aggregate functions based on our new interface can be found at https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921. stddev_pop and stddev_samp aggregate functions -- Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9271) Concurrency bug triggered by partition predicate push-down
[ https://issues.apache.org/jira/browse/SPARK-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-9271: -- Description: SPARK-6910 and [PR #7492|https://github.com/apache/spark/pull/7492] introduced partition predicate push-down. However, it seems that it triggers one or more existing concurrency bug(s) (see [this GitHub comment|https://github.com/apache/spark/pull/7421#issuecomment-122527391] for details), and has been causing random Jenkins build failures. This issue need further investigation and must be fixed for 1.5.0. Observed test failures possibly related to this issue: - {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.partcols1}} - {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.auto_sortmerge_join_16}} was:SPARK-6910 and [PR #7492|https://github.com/apache/spark/pull/7492] introduced partition predicate push-down. However, it seems that it triggers one or more existing concurrency bug(s) (see [this GitHub comment|https://github.com/apache/spark/pull/7421#issuecomment-122527391] for details), and has been causing random Jenkins build failures. This issue need further investigation and must be fixed for 1.5.0. Concurrency bug triggered by partition predicate push-down -- Key: SPARK-9271 URL: https://issues.apache.org/jira/browse/SPARK-9271 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Priority: Blocker SPARK-6910 and [PR #7492|https://github.com/apache/spark/pull/7492] introduced partition predicate push-down. However, it seems that it triggers one or more existing concurrency bug(s) (see [this GitHub comment|https://github.com/apache/spark/pull/7421#issuecomment-122527391] for details), and has been causing random Jenkins build failures. This issue need further investigation and must be fixed for 1.5.0. Observed test failures possibly related to this issue: - {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.partcols1}} - {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.auto_sortmerge_join_16}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6548) stddev_pop and stddev_samp aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6548: Summary: stddev_pop and stddev_samp aggregate functions (was: Adding stddev to DataFrame functions) stddev_pop and stddev_samp aggregate functions -- Key: SPARK-6548 URL: https://issues.apache.org/jira/browse/SPARK-6548 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame, starter Add it to the list of aggregate functions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Also add it to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala We can either add a Stddev Catalyst expression, or just compute it using existing functions like here: https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9295) Analysis should detect sorting on unsupported column types
[ https://issues.apache.org/jira/browse/SPARK-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9295: --- Assignee: Apache Spark (was: Josh Rosen) Analysis should detect sorting on unsupported column types -- Key: SPARK-9295 URL: https://issues.apache.org/jira/browse/SPARK-9295 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Apache Spark The SQL analyzer should report errors for queries that try to sort on columns of unsupported types, such as ArrayType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9295) Analysis should detect sorting on unsupported column types
[ https://issues.apache.org/jira/browse/SPARK-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9295: --- Assignee: Josh Rosen (was: Apache Spark) Analysis should detect sorting on unsupported column types -- Key: SPARK-9295 URL: https://issues.apache.org/jira/browse/SPARK-9295 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen The SQL analyzer should report errors for queries that try to sort on columns of unsupported types, such as ArrayType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9295) Analysis should detect sorting on unsupported column types
[ https://issues.apache.org/jira/browse/SPARK-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639912#comment-14639912 ] Apache Spark commented on SPARK-9295: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7633 Analysis should detect sorting on unsupported column types -- Key: SPARK-9295 URL: https://issues.apache.org/jira/browse/SPARK-9295 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen The SQL analyzer should report errors for queries that try to sort on columns of unsupported types, such as ArrayType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4366) Aggregation Improvement
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639400#comment-14639400 ] Herman van Hovell commented on SPARK-4366: -- What is going to happen to the old Aggregate function code path? Will this still be in 1.5? Or will it be removed? Aggregation Improvement --- Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Critical Attachments: aggregatefunction_v1.pdf This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4366) Aggregation Improvement
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639403#comment-14639403 ] Yin Huai commented on SPARK-4366: - It will be probably still in 1.5 (right now, we still have a few cases that need to fallback to the old path. For example, when you have multiple distinct columns). But, by default, we will use the new code path. In 1.6, the old path will be remove. Aggregation Improvement --- Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Critical Attachments: aggregatefunction_v1.pdf This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication
[ https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639472#comment-14639472 ] Sudhakar Thota commented on SPARK-8359: --- This is not working and breaking at 2^112 with spark 1.4.1, but working with git version with which I went up to 2^1020. Just an FYI. Spark SQL Decimal type precision loss on multiplication --- Key: SPARK-8359 URL: https://issues.apache.org/jira/browse/SPARK-8359 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Rene Treffer It looks like the precision of decimal can not be raised beyond ~2^112 without causing full value truncation. The following code computes the power of two up to a specific point {code} import org.apache.spark.sql.types.Decimal val one = Decimal(1) val two = Decimal(2) def pow(n : Int) : Decimal = if (n = 0) { one } else { val a = pow(n - 1) a.changePrecision(n,0) two.changePrecision(n,0) a * two } (109 to 120).foreach(n = println(pow(n).toJavaBigDecimal.unscaledValue.toString)) 649037107316853453566312041152512 1298074214633706907132624082305024 2596148429267413814265248164610048 5192296858534827628530496329220096 1038459371706965525706099265844019 2076918743413931051412198531688038 4153837486827862102824397063376076 8307674973655724205648794126752152 1661534994731144841129758825350430 3323069989462289682259517650700860 6646139978924579364519035301401720 1329227995784915872903807060280344 {code} Beyond ~2^112 the precision is truncated even if the precision was set to n and should thus handle 10^n without problems.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1744) Document how to pass in preferredNodeLocationData
[ https://issues.apache.org/jira/browse/SPARK-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-1744. --- Resolution: Won't Fix Document how to pass in preferredNodeLocationData - Key: SPARK-1744 URL: https://issues.apache.org/jira/browse/SPARK-1744 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9286) Methods in Unevaluable should be final
Josh Rosen created SPARK-9286: - Summary: Methods in Unevaluable should be final Key: SPARK-9286 URL: https://issues.apache.org/jira/browse/SPARK-9286 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Trivial The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait should be marked as {{final}} and we should fix any cases where they are overridden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port
[ https://issues.apache.org/jira/browse/SPARK-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omar Padron updated SPARK-9279: --- Description: When trying to start a spark master server as root... {code} export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=80 spark-class org.apache.spark.deploy.master.Master \ --host $( hostname ) \ --port $SPARK_MASTER_PORT \ --webui-port $SPARK_MASTER_WEBUI_PORT {code} The process terminates with IllegalArgumentException requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs fine. I do not understand why the usable ports have been arbitrarily restricted to the non-privileged. Users choosing to run spark as root should be allowed to choose their own ports. Full output from a sample run below: {code} 2015-07-23 14:36:50,892 INFO [main] master.Master (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, INT] 2015-07-23 14:36:51,399 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-07-23 14:36:51,586 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: root 2015-07-23 14:36:51,587 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: root 2015-07-23 14:36:51,588 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-07-23 14:36:52,295 INFO [sparkMaster-akka.actor.default-dispatcher-2] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2015-07-23 14:36:52,349 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2015-07-23 14:36:52,489 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkMaster@sparkmaster:7077] 2015-07-23 14:36:52,497 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkMaster' on port 7077. 2015-07-23 14:36:52,717 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SelectChannelConnector@sparkmaster:6066 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 6066. 2015-07-23 14:36:52,760 INFO [sparkMaster-akka.actor.default-dispatcher-4] rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server for submitting applications on port 6066 2015-07-23 14:36:52,765 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Starting Spark master at spark://sparkmaster:7077 2015-07-23 14:36:52,766 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1 2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI java.lang.IllegalArgumentException: requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1977) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238) at org.apache.spark.ui.WebUI.bind(WebUI.scala:117) at org.apache.spark.deploy.master.Master.preStart(Master.scala:144) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.master.Master.aroundPreStart(Master.scala:52) at akka.actor.ActorCell.create(ActorCell.scala:580) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Affects Version/s: 1.3.1 New HiveContext object unexpectedly loads configuration settings from history -- Key: SPARK-9280 URL: https://issues.apache.org/jira/browse/SPARK-9280 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: Tien-Dung LE In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. Here is a code to show this scenario. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
Tien-Dung LE created SPARK-9280: --- Summary: New HiveContext object unexpectedly loads configuration settings from history Key: SPARK-9280 URL: https://issues.apache.org/jira/browse/SPARK-9280 Project: Spark Issue Type: Bug Reporter: Tien-Dung LE In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. Here is a code to show this scenario. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Description: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) // got 20 as expected val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} was: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} New HiveContext object unexpectedly loads configuration settings from history -- Key: SPARK-9280 URL: https://issues.apache.org/jira/browse/SPARK-9280 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Tien-Dung LE In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) // got 20 as expected val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Component/s: SQL New HiveContext object unexpectedly loads configuration settings from history -- Key: SPARK-9280 URL: https://issues.apache.org/jira/browse/SPARK-9280 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Tien-Dung LE In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. Here is a code to show this scenario. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history
[ https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-9280: Description: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} was: In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. Here is a code to show this scenario. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} New HiveContext object unexpectedly loads configuration settings from history -- Key: SPARK-9280 URL: https://issues.apache.org/jira/browse/SPARK-9280 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Tien-Dung LE In a spark-shell session, stopping a spark context and create a new spark context and hive context does not clean the spark sql configuration. More precisely, the new hive context still keeps the previous configuration settings. It would be great if someone can let us know how to avoid this situation. {code:title=New hive context should not load the configurations from history} case class Foo ( x: Int = (math.random * 1e3).toInt) val foo = (1 to 100).map(i = Foo()).toDF foo.saveAsParquetFile( foo ) sqlContext.setConf( spark.sql.shuffle.partitions, 10) sc.stop val sparkConf2 = new org.apache.spark.SparkConf() val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 ) sqlContext2.getConf( spark.sql.shuffle.partitions, 20) val foo2 = sqlContext2.parquetFile( foo ) sqlContext2.getConf( spark.sql.shuffle.partitions, 30) // expected 30 but got 10 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data
[ https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638843#comment-14638843 ] Steve Lindemann commented on SPARK-9278: Here are the steps to reproduce the issue. First, create a Hive table with the desired schema: {noformat} In [1]: hc = pyspark.sql.HiveContext(sqlContext) In [2]: pdf = pd.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 'o', 'u']*3, 'v': range(15)}) In [3]: sdf = hc.createDataFrame(pdf) In [4]: sdf.show() +-+--+--+ |k|pk| v| +-+--+--+ |a| a| 0| |e| a| 1| |i| a| 2| |o| a| 3| |u| a| 4| |a| b| 5| |e| b| 6| |i| b| 7| |o| b| 8| |u| b| 9| |a| c|10| |e| c|11| |i| c|12| |o| c|13| |u| c|14| +-+--+--+ In [5]: sdf.filter('FALSE').write.partitionBy('pk').saveAsTable('foo', format='parquet', path='s3a://eglp-core-temp/tmp/foo') {noformat} A table has been created: {noformat} In [33]: print('\n'.join(r.result for r in hc.sql('SHOW CREATE TABLE foo').collect())) CREATE EXTERNAL TABLE `foo`( `col` arraystring COMMENT 'from deserializer') PARTITIONED BY ( `pk` string COMMENT '') ROW FORMAT DELIMITED STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' LOCATION 's3a://eglp-core-data/hive/warehouse/foo' TBLPROPERTIES ( 'spark.sql.sources.schema.part.0'='{\type\:\struct\,\fields\:[{\name\:\k\,\type\:\string\,\nullable\:true,\metadata\:{}},{\name\:\v\,\type\:\long\,\nullable\:true,\metadata\:{}},{\name\:\pk\,\type\:\string\,\nullable\:true,\metadata\:{}}]}', 'transient_lastDdlTime'='1437657391', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.provider'='parquet') {noformat} Now, write a new partition of data (note that this is from the same DataFrame from which the table was created): {noformat} sdf.filter(sdf.pk == 'a').write.partitionBy('pk').insertInto('foo') {noformat} Then, select the data: {noformat} In [7]: foo = hc.table('foo') In [8]: foo.show() +-++--+ |k| v|pk| +-++--+ |a|null| 0| |o|null| 3| |i|null| 2| |e|null| 1| |u|null| 4| +-++--+ In [9]: sdf.filter(sdf.pk == 'a').show() +-+--+-+ |k|pk|v| +-+--+-+ |a| a|0| |e| a|1| |i| a|2| |o| a|3| |u| a|4| +-+--+-+ {noformat} So clearly it inserted incorrect data. By reordering the columns, we can insert data properly: {noformat} In [10]: pdf2 = pdf[['k', 'v', 'pk']] In [11]: sdf2 = hc.createDataFrame(pdf2) In [12]: sdf2.filter(sdf2.pk == 'a').write.partitionBy('pk').insertInto('foo') In [13]: hc.refreshTable('foo') In [14]: foo = hc.table('foo') In [15]: foo.show() +-++--+ |k| v|pk| +-++--+ |a|null| 0| |o|null| 3| |i|null| 2| |e|null| 1| |u|null| 4| |o| 3| a| |u| 4| a| |a| 0| a| |e| 1| a| |i| 2| a| +-++--+ {noformat} DataFrameWriter.insertInto inserts incorrect data - Key: SPARK-9278 URL: https://issues.apache.org/jira/browse/SPARK-9278 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: Linux, S3, Hive Metastore Reporter: Steve Lindemann After creating a partitioned Hive table (stored as Parquet) via the DataFrameWriter.createTable command, subsequent attempts to insert additional data into new partitions of this table result in inserting incorrect data rows. Reordering the columns in the data to be written seems to avoid this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port
Omar Padron created SPARK-9279: -- Summary: Spark Master Refuses to Bind WebUI to a Privileged Port Key: SPARK-9279 URL: https://issues.apache.org/jira/browse/SPARK-9279 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Environment: Ubuntu Trusty running in a docker container Reporter: Omar Padron Priority: Minor When trying to start a spark master server as root... {code} export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=80 spark-class org.apache.spark.deploy.master.Master \ --host $( hostname ) \ -- port $SPARK_MASTER_PORT \ --webui-port $SPARK_MASTER_WEBUI_PORT {code} The process terminates with IllegalArgumentException requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs fine. I do not understand why the usable ports have been arbitrarily restricted to the non-privileged. Users choosing to run spark as root should be allowed to choose their own ports. Full output from a sample run below: {code} 2015-07-23 14:36:50,892 INFO [main] master.Master (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, INT] 2015-07-23 14:36:51,399 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-07-23 14:36:51,586 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: root 2015-07-23 14:36:51,587 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: root 2015-07-23 14:36:51,588 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-07-23 14:36:52,295 INFO [sparkMaster-akka.actor.default-dispatcher-2] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2015-07-23 14:36:52,349 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2015-07-23 14:36:52,489 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkMaster@sparkmaster:7077] 2015-07-23 14:36:52,497 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkMaster' on port 7077. 2015-07-23 14:36:52,717 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SelectChannelConnector@sparkmaster:6066 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 6066. 2015-07-23 14:36:52,760 INFO [sparkMaster-akka.actor.default-dispatcher-4] rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server for submitting applications on port 6066 2015-07-23 14:36:52,765 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Starting Spark master at spark://sparkmaster:7077 2015-07-23 14:36:52,766 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1 2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI java.lang.IllegalArgumentException: requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1977) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238) at org.apache.spark.ui.WebUI.bind(WebUI.scala:117) at org.apache.spark.deploy.master.Master.preStart(Master.scala:144) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.master.Master.aroundPreStart(Master.scala:52) at akka.actor.ActorCell.create(ActorCell.scala:580) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at
[jira] [Resolved] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port
[ https://issues.apache.org/jira/browse/SPARK-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9279. -- Resolution: Not A Problem This has nothing to do with Spark. Any Linux-like OS X requires root privileges for any process to bind to a port under 1024. Spark Master Refuses to Bind WebUI to a Privileged Port --- Key: SPARK-9279 URL: https://issues.apache.org/jira/browse/SPARK-9279 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Environment: Ubuntu Trusty running in a docker container Reporter: Omar Padron Priority: Minor When trying to start a spark master server as root... {code} export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=80 spark-class org.apache.spark.deploy.master.Master \ --host $( hostname ) \ --port $SPARK_MASTER_PORT \ --webui-port $SPARK_MASTER_WEBUI_PORT {code} The process terminates with IllegalArgumentException requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs fine. I do not understand why the usable ports have been arbitrarily restricted to the non-privileged. Users choosing to run spark as root should be allowed to choose their own ports. Full output from a sample run below: {code} 2015-07-23 14:36:50,892 INFO [main] master.Master (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, INT] 2015-07-23 14:36:51,399 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-07-23 14:36:51,586 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: root 2015-07-23 14:36:51,587 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: root 2015-07-23 14:36:51,588 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-07-23 14:36:52,295 INFO [sparkMaster-akka.actor.default-dispatcher-2] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2015-07-23 14:36:52,349 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2015-07-23 14:36:52,489 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkMaster@sparkmaster:7077] 2015-07-23 14:36:52,497 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkMaster' on port 7077. 2015-07-23 14:36:52,717 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SelectChannelConnector@sparkmaster:6066 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 6066. 2015-07-23 14:36:52,760 INFO [sparkMaster-akka.actor.default-dispatcher-4] rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server for submitting applications on port 6066 2015-07-23 14:36:52,765 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Starting Spark master at spark://sparkmaster:7077 2015-07-23 14:36:52,766 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1 2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI java.lang.IllegalArgumentException: requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1977) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238) at org.apache.spark.ui.WebUI.bind(WebUI.scala:117) at org.apache.spark.deploy.master.Master.preStart(Master.scala:144) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.master.Master.aroundPreStart(Master.scala:52) at akka.actor.ActorCell.create(ActorCell.scala:580) at
[jira] [Updated] (SPARK-9270) spark.app.name is not honored by pyspark
[ https://issues.apache.org/jira/browse/SPARK-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated SPARK-9270: - Description: Currently, the app name is hardcoded in pyspark as PySparkShell, and the {{spark.app.name}} property is not honored. SPARK-8650 and SPARK-9180 fixed this issue for spark-sql and spark-shell, but pyspark is not fixed yet. sparkR is different because {{SparkContext}} is not automatically constructed in sparkR, and the app name can be set when initializing {{SparkContext}}. In summary- ||shell||support --conf spark.app.name|| |pyspark|no| |spark-shell|yes| |spark-sql|yes| |sparkR|n/a| was: Currently, the app name is hardcoded in spark-shell and pyspark as SparkShell and PySparkShell respectively, and the {{spark.app.name}} property is not honored. But being able to set the app name is quite handy for various cluster operations. For eg, filter jobs whose app name is X on YARN RM page, etc. SPARK-8650 fixed this issue for spark-sql, but it didn't for spark-shell and pyspark. sparkR is different because {{SparkContext}} is not automatically constructed in sparkR, and the app name can be set when intializing {{SparkContext}}. In summary- ||shell||support --conf spark.app.name|| |spark-shell|no| |pyspark|no| |spark-sql|yes| |sparkR|n/a| Component/s: (was: Spark Shell) Summary: spark.app.name is not honored by pyspark (was: spark.app.name is not honored by spark-shell and pyspark) spark.app.name is not honored by pyspark Key: SPARK-9270 URL: https://issues.apache.org/jira/browse/SPARK-9270 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1, 1.5.0 Reporter: Cheolsoo Park Priority: Minor Currently, the app name is hardcoded in pyspark as PySparkShell, and the {{spark.app.name}} property is not honored. SPARK-8650 and SPARK-9180 fixed this issue for spark-sql and spark-shell, but pyspark is not fixed yet. sparkR is different because {{SparkContext}} is not automatically constructed in sparkR, and the app name can be set when initializing {{SparkContext}}. In summary- ||shell||support --conf spark.app.name|| |pyspark|no| |spark-shell|yes| |spark-sql|yes| |sparkR|n/a| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7254) Extend PIC to handle Graphs directly
[ https://issues.apache.org/jira/browse/SPARK-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7254. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6054 [https://github.com/apache/spark/pull/6054] Extend PIC to handle Graphs directly Key: SPARK-7254 URL: https://issues.apache.org/jira/browse/SPARK-7254 Project: Spark Issue Type: New Feature Components: GraphX, MLlib Reporter: Joseph K. Bradley Fix For: 1.5.0 We should extend the PowerIterationClustering API to handle Graphs. Users can do spectral clustering on graphs using PIC currently, but they must handle the boilerplate of converting the Graph to an RDD for PIC, running PIC, and then matching the results back with their Graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7254) Extend PIC to handle Graphs directly
[ https://issues.apache.org/jira/browse/SPARK-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7254: - Assignee: Liang-Chi Hsieh Extend PIC to handle Graphs directly Key: SPARK-7254 URL: https://issues.apache.org/jira/browse/SPARK-7254 Project: Spark Issue Type: New Feature Components: GraphX, MLlib Reporter: Joseph K. Bradley Assignee: Liang-Chi Hsieh Fix For: 1.5.0 We should extend the PowerIterationClustering API to handle Graphs. Users can do spectral clustering on graphs using PIC currently, but they must handle the boilerplate of converting the Graph to an RDD for PIC, running PIC, and then matching the results back with their Graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7254) Extend PIC to handle Graphs directly
[ https://issues.apache.org/jira/browse/SPARK-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7254: - Target Version/s: 1.5.0 Extend PIC to handle Graphs directly Key: SPARK-7254 URL: https://issues.apache.org/jira/browse/SPARK-7254 Project: Spark Issue Type: New Feature Components: GraphX, MLlib Reporter: Joseph K. Bradley Assignee: Liang-Chi Hsieh Fix For: 1.5.0 We should extend the PowerIterationClustering API to handle Graphs. Users can do spectral clustering on graphs using PIC currently, but they must handle the boilerplate of converting the Graph to an RDD for PIC, running PIC, and then matching the results back with their Graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9243) Update crosstab doc for pairs that have no occurrences
[ https://issues.apache.org/jira/browse/SPARK-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-9243: Assignee: Xiangrui Meng Update crosstab doc for pairs that have no occurrences -- Key: SPARK-9243 URL: https://issues.apache.org/jira/browse/SPARK-9243 Project: Spark Issue Type: Improvement Components: Documentation, PySpark, SparkR, SQL Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng The crosstab value for pairs that have no occurrences was changed from null to 0 in SPARK-7982. We should update the doc in Scala, Python, and SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9272) Persist information of individual partitions when persisting partitioned data source tables to metastore
Cheng Lian created SPARK-9272: - Summary: Persist information of individual partitions when persisting partitioned data source tables to metastore Key: SPARK-9272 URL: https://issues.apache.org/jira/browse/SPARK-9272 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Currently, when a partitioned data source table is persisted to Hive metastore, we only persist its partition columns. Information about individual partitions are not persisted. This forces us to do a partition discovery before reading a persisted partitioned table, which hurts performance. To fix this issue, we may persist partition information into metastore. Specifically, the format should be compatible with Hive to ensure interoperability. One of the approach to collect partition values and partition directory path for dynamicly partitioned tables is to use accumulators to collect expected information during the write job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org