[jira] [Resolved] (SPARK-2247) Data frame (or Pandas) like API for structured data
[ https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2247. Resolution: Duplicate Assignee: Reynold Xin Target Version/s: 1.3.0 Data frame (or Pandas) like API for structured data --- Key: SPARK-2247 URL: https://issues.apache.org/jira/browse/SPARK-2247 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Affects Versions: 1.0.0 Reporter: venu k tangirala Assignee: Reynold Xin Labels: features I would be nice to have R or python pandas like data frames on spark. 1) To be able to access the RDD data frame from python with pandas 2) To be able to access the RDD data frame from R 3) To be able to access the RDD data frame from scala's saddle -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5097: --- Description: SchemaRDD, through its DSL, already has many of the functionalities provided by common data frame operations. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. was: SchemaRDD, through its DSL, already has many of the functionalities provided by common data frame operations. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes to make the SchemaRDD DSL API more usable and stable. Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already has many of the functionalities provided by common data frame operations. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2247) Data frame (or Pandas) like API for structured data
[ https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265784#comment-14265784 ] Reynold Xin commented on SPARK-2247: Ok I uploaded a design doc in https://issues.apache.org/jira/browse/SPARK-5097 Data frame (or Pandas) like API for structured data --- Key: SPARK-2247 URL: https://issues.apache.org/jira/browse/SPARK-2247 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Affects Versions: 1.0.0 Reporter: venu k tangirala Assignee: Reynold Xin Labels: features I would be nice to have R or python pandas like data frames on spark. 1) To be able to access the RDD data frame from python with pandas 2) To be able to access the RDD data frame from R 3) To be able to access the RDD data frame from scala's saddle -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5097: --- Description: SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. was: SchemaRDD, through its DSL, already has many of the functionalities provided by common data frame operations. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
[ https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4843. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3696 [https://github.com/apache/spark/pull/3696] Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module - Key: SPARK-4843 URL: https://issues.apache.org/jira/browse/SPARK-4843 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis Assignee: Kostas Sakellis Fix For: 1.3.0 ExecutorRunnableUtil is a parent of ExecutorRunnable because of the yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash the unnecessary hierarchy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5098) Number of running tasks become negative after tasks lost
[ https://issues.apache.org/jira/browse/SPARK-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-5098: -- Description: 15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06 07:26:58 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 55, spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 lost) 15/01/06 07:26:58 WARN TaskSetManager: Lost task 7.2 in stage 0.0 (TID 52, spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 lost) 15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 [Stage 0:===(44 + -14) / 40] 15/01/06 07:27:10 ERROR TaskSchedulerImpl: Lost executor 2 on spark-worker-003.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:27:10 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-003.c.lofty-inn-754.internal:39188] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06 07:27:10 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 60, spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 lost) 15/01/06 07:27:10 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 lost) 15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 [Stage 0:==(45 + -29) / 40] was: 15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06 07:26:58 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 55, spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 lost) 15/01/06 07:26:58 WARN TaskSetManager: Lost task 7.2 in stage 0.0 (TID 52, spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 lost) 15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 [Stage 0:=(44 + -14) / 40] 15/01/06 07:27:10 ERROR TaskSchedulerImpl: Lost executor 2 on spark-worker-003.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:27:10 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-003.c.lofty-inn-754.internal:39188] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06 07:27:10 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 60, spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 lost) 15/01/06 07:27:10 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 lost) 15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 [Stage 0:=(45 + -29) / 40] Number of running tasks become negative after tasks lost Key: SPARK-5098 URL: https://issues.apache.org/jira/browse/SPARK-5098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Critical 15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265749#comment-14265749 ] Aniket Bhatnagar commented on SPARK-3452: - I would like this to be revisited. The issue I am facing is that while people may not dependent on some modules during compile time but they may dependent on them during runtime. For example, I am building a spark server that lets users submit spark jobs using convenient REST endpoints. This used to work great even in yarn-client mode. However, once I migrate to 1.2.0, this breaks because I can no longer add dependency of my spark server to spark-yarn module which is used while submitting jobs to YARN cluster. Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5097) Adding data frame APIs to SchemaRDD
Reynold Xin created SPARK-5097: -- Summary: Adding data frame APIs to SchemaRDD Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin SchemaRDD, through its DSL, already has many of the functionalities provided by common data frame operations. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5097: --- Attachment: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already has many of the functionalities provided by common data frame operations. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5098) Number of running tasks become negative after tasks lost
Davies Liu created SPARK-5098: - Summary: Number of running tasks become negative after tasks lost Key: SPARK-5098 URL: https://issues.apache.org/jira/browse/SPARK-5098 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Critical 15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06 07:26:58 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 55, spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 lost) 15/01/06 07:26:58 WARN TaskSetManager: Lost task 7.2 in stage 0.0 (TID 52, spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 lost) 15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 [Stage 0:=(44 + -14) / 40] 15/01/06 07:27:10 ERROR TaskSchedulerImpl: Lost executor 2 on spark-worker-003.c.lofty-inn-754.internal: remote Akka client disassociated 15/01/06 07:27:10 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-worker-003.c.lofty-inn-754.internal:39188] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/01/06 07:27:10 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 60, spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 lost) 15/01/06 07:27:10 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 lost) 15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 [Stage 0:=(45 + -29) / 40] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265796#comment-14265796 ] Timothy Chen commented on SPARK-5095: - I think instead of configuring the number of executors to launch per slave, I think it's more ideal to configure the amount of cpu/mem per executor. My current thoughts for implementation is to introduce two more configs: spark.mesos.coarse.executors.max -- the maximum amount of executors launched per slave, applies to coarse grain mode spark.mesos.coarse.cores.max -- the maximum amount of cpus to use per executor Memory is already configurable through spark.executor.memory. With these, you can choose to launch two executors by specifiying two max executors and also capping the max cpus to be halved the amount. These configurations can also fix SPARK-4940. Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265802#comment-14265802 ] Timothy Chen commented on SPARK-4940: - So I assume you're specifiying coarse grain mode right? And how are streaming consumers launched? I know that on the scheduler side it is launching spark executors/drivers, and we simply launch one spark executor per slave that is running multiple spark tasks. My assumption was that it was the number of resources allocated that is disproportional to each slave's executor. Support more evenly distributing cores for Mesos mode - Key: SPARK-4940 URL: https://issues.apache.org/jira/browse/SPARK-4940 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in Coarse grain mode the spark scheduler simply takes all the resources it can on each node, but can cause uneven distribution based on resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1426#comment-1426 ] Tathagata Das commented on SPARK-4960: -- Both [~c...@koeninger.org] and [~jerryshao] are valid. And I looked at the document, its quite a good proposal. However there are still some corner cases that is very confusing. What happens when the user accidentally tries to do something like this val input = ssc.socketStream(...) val intercepted = input.interceptor(...) Now actually use `input` for further processing. Since the `input` stream gets deregistered, there will not be any data. To avoid this kind of situation, here is a more limited idea. For the generic interceptor pattern applicable to all receiver, lets assume that function can be of the form T = Iterator[T]. This eliminates the need for changing data types, and probably addresses corner cases like the one I raised. For M Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264450#comment-14264450 ] Tathagata Das commented on SPARK-4960: -- [~ted.m] This interceptor pattern discussion should be of interest to you! Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1426#comment-1426 ] Tathagata Das edited comment on SPARK-4960 at 1/5/15 10:15 AM: --- Both [~c...@koeninger.org] and [~jerryshao] are valid. And I looked at the document, its quite a good proposal. However there are still some corner cases that is very confusing. What happens when the user accidentally tries to do something like this val input = ssc.socketStream(...) val intercepted = input.interceptor(...) Now actually use `input` for further processing. Since the `input` stream gets deregistered, there will not be any data. To avoid this kind of situation, here is a more limited idea. For the generic interceptor pattern applicable to all receiver, lets assume that function can be of the form T = Iterator[T]. This eliminates the need for changing data types, and probably addresses corner cases like the one I raised. We can leave the general case of T = Iterator[M] for the users to implement their own receivers, in the same as [~jerryshao] has suggested in his doc. That is quite hacky (type casting Receiver[T] to Receiver[M]) and so its probably not the best to have that available by default with Spark Streaming. was (Author: tdas): Both [~c...@koeninger.org] and [~jerryshao] are valid. And I looked at the document, its quite a good proposal. However there are still some corner cases that is very confusing. What happens when the user accidentally tries to do something like this val input = ssc.socketStream(...) val intercepted = input.interceptor(...) Now actually use `input` for further processing. Since the `input` stream gets deregistered, there will not be any data. To avoid this kind of situation, here is a more limited idea. For the generic interceptor pattern applicable to all receiver, lets assume that function can be of the form T = Iterator[T]. This eliminates the need for changing data types, and probably addresses corner cases like the one I raised. For M Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries
[ https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264469#comment-14264469 ] Cheng Lian commented on SPARK-4908: --- Would like to add a comment about the root cause of this issue. When serving a HiveQL query, Spark SQL's {{HiveContext.runHive}} method gets a {{org.apache.hadoop.hive.ql.Driver}} instance via {{CommandProcessFactory.get}}, which creates and caches {{Driver}} instances. In the case of {{HiveThriftServer2}}, {{HiveContext.runHive}} is called by multiple threads owned by a threaded executor of the Thrift server. However, {{Driver}} is not thread safe, but cached {{Driver}} instance can be accessed by multiple threads, thus causes problem. PR #3834 fixes this issue by synchronizing {{HiveContext.runHive}}, which is valid. On the other hand, HiveServer2 actually create a new {{Driver}} instance for every served SQL query when initializing a {{SQLOperation}}. [~dyross] When built against Hive 0.12.0, Spark SQL 1.2.0 also suffers this issue. The snippet doesn't show this because Hive 0.12.0 JDBC driver doesn't execute a {{USE db}} statement to switch current database even if the JDBC connection URL specifies a database name. If you replace the lines in the {{try}} block with: {code} val conn = DriverManager.getConnection(url) val stmt = conn.createStatement() stmt.execute(use hello;) stmt.close() println(Finished: + i) {code} you'll see exactly the same exceptions. Spark SQL built for Hive 13 fails under concurrent metadata queries --- Key: SPARK-4908 URL: https://issues.apache.org/jira/browse/SPARK-4908 Project: Spark Issue Type: Bug Components: SQL Reporter: David Ross Assignee: Cheng Lian Priority: Blocker Fix For: 1.3.0, 1.2.1 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6 We are using Spark built for Hive 13, using this option: {{-Phive-0.13.1}} In single-threaded mode, normal operations look fine. However, under concurrency, with at least 2 concurrent connections, metadata queries fail. For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} statement when you pass a default schema in the JDBC URL, all fail. {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue. Here is some example code: {code} object main extends App { import java.sql._ import scala.concurrent._ import scala.concurrent.duration._ import scala.concurrent.ExecutionContext.Implicits.global Class.forName(org.apache.hive.jdbc.HiveDriver) val host = localhost // update this val url = sjdbc:hive2://${host}:10511/some_db // update this val future = Future.traverse(1 to 3) { i = Future { println(Starting: + i) try { val conn = DriverManager.getConnection(url) } catch { case e: Throwable = e.printStackTrace() println(Failed: + i) } println(Finishing: + i) } } Await.result(future, 2.minutes) println(done!) } {code} Here is the output: {code} Starting: 1 Starting: 3 Starting: 2 java.sql.SQLException: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation cancelled at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231) at org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451) at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195) at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at
[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jeanlyn updated SPARK-5068: --- Fix Version/s: (was: 1.2.1) When the path not found in the hdfs,we can't get the result --- Key: SPARK-5068 URL: https://issues.apache.org/jira/browse/SPARK-5068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: {noformat} hive show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) {noformat} {noformat} hive dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 {noformat} when i run the sql {noformat} select * from partition_test limit 10 {noformat} in *hive*,i got no problem,but when i run in *spark-sql* i get the error as follow: {noformat} Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.
[ https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264508#comment-14264508 ] Sean Owen commented on SPARK-5066: -- I'm not clear what this issue is trying to report. This is code from {{ExternalAppendOnlyMap}} right? The javadoc says: {code} * Fill a buffer with the next set of keys with the same hash code from a given iterator. We * read streams one hash code at a time to ensure we don't miss elements when they are merged. * * Assumes the given iterator is in sorted order of hash code. {code} The behavior and code you describe seems correct then. k4 and k5 would be read from the stream for file 2 first since they have the lowest hashes. Next, k1 would be read from both files. Where are you saying that this breaks down? Can not get all key that has same hashcode when reading key ordered from different Streaming. --- Key: SPARK-5066 URL: https://issues.apache.org/jira/browse/SPARK-5066 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: DoingDone9 Priority: Critical when spill is open, data ordered by hashCode will be spilled to disk. We need get all key that has the same hashCode from different tmp files when merge value, but it just read the key that has the minHashCode that in a tmp file, we can not read all key. Example : If file1 has [k1, k2, k3], file2 has [k4,k5,k1]. And hashcode of k4 hashcode of k5 hashcode of k1 hashcode of k2 hashcode of k3 we just read k1 from file1 and k4 from file2. Can not read all k1. Code : private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it = it.buffered) inputStreams.foreach { it = val kcPairs = new ArrayBuffer[(K, C)] readNextHashCode(it, kcPairs) if (kcPairs.length 0) { mergeHeap.enqueue(new StreamBuffer(it, kcPairs)) } } private def readNextHashCode(it: BufferedIterator[(K, C)], buf: ArrayBuffer[(K, C)]): Unit = { if (it.hasNext) { var kc = it.next() buf += kc val minHash = hashKey(kc) while (it.hasNext it.head._1.hashCode() == minHash) { kc = it.next() buf += kc } } } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization
[ https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5082. -- Resolution: Not a Problem This was already fixed by https://github.com/apache/spark/commit/4da1039840182e8e8bc836b89cda7b77fe7356d9 Minor typo in the Tuning Spark document about Data Serialization Key: SPARK-5082 URL: https://issues.apache.org/jira/browse/SPARK-5082 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: yangping wu Priority: Trivial Original Estimate: 8h Remaining Estimate: 8h The latest documentation for *Tuning Spark* has some error entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data Serialization*: section: {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* not *Seq[Class[_]]*. The right code snippets is {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264527#comment-14264527 ] Sean Owen commented on SPARK-1529: -- OK, but, why do these files *have* to go on a non-local disk? It sounds like you're saying Spark doesn't work at all on MapR right now, but that can't be the case. They *can* go on a non-local disk, I'm sure. What's the value of that, given that Spark is transporting the files itself? Still, as you say, this proprietary setup works already through the java.io+NFS and HDFS APIs, with no change. If it's just not as fast, is that a problem that Spark should be solving? Just don't do it. Or it's up to the vendor to optimize. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value
[ https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264535#comment-14264535 ] Sean Owen commented on SPARK-5073: -- The documentation suggests that 8192 is the intended default, and that's consistent with the javadoc that says this should be a limit near the OS page size, which is indeed I think 4KB on modern OSes (?). So open a PR to fix the default in {{TransportConf}}? spark.storage.memoryMapThreshold have two default value - Key: SPARK-5073 URL: https://issues.apache.org/jira/browse/SPARK-5073 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Yuan Jianhui Priority: Minor In org.apache.spark.storage.DiskStore: val minMemoryMapBytes = blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L) In org.apache.spark.network.util.TransportConf: public int memoryMapBytes() { return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 1024); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception
[ https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shengli updated SPARK-5009: --- Description: Recently I found a bug when I add new feature in SqlParser. Which is : If I define a KeyWord that has a long name. Like: ```protected val :SERDEPROPERTIES = Keyword(SERDEPROPERTIES)``` Since the all case version is implement by recursive function, so when ```implicit asParser`` function is called and the stack memory is very small, it will leads to SO Exception. java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) was: Recently I found a bug when I add new feature in SqlParser. Which is : If I define a KeyWord that has a long name. Like: ```protected val :SERDEPROPERTIES = Keyword(SERDEPROPERTIES)``` Since the all case version is implement by recursive function, so when ```implicit asParser`` function is called and the stack memory is very small, it will leads to SO Exception. allCaseVersions function in SqlLexical leads to StackOverflow Exception - Key: SPARK-5009 URL: https://issues.apache.org/jira/browse/SPARK-5009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.0 Reporter: shengli Fix For: 1.3.0, 1.2.1 Original Estimate: 96h Remaining Estimate: 96h Recently I found a bug when I add new feature in SqlParser. Which is : If I define a KeyWord that has a long name. Like: ```protected val :SERDEPROPERTIES = Keyword(SERDEPROPERTIES)``` Since the all case version is implement by recursive function, so when ```implicit asParser`` function is called and the stack memory is very small, it will leads to SO Exception. java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value
[ https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264554#comment-14264554 ] Apache Spark commented on SPARK-5073: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/3900 spark.storage.memoryMapThreshold have two default value - Key: SPARK-5073 URL: https://issues.apache.org/jira/browse/SPARK-5073 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Yuan Jianhui Priority: Minor In org.apache.spark.storage.DiskStore: val minMemoryMapBytes = blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L) In org.apache.spark.network.util.TransportConf: public int memoryMapBytes() { return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 1024); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4940) Document or Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264579#comment-14264579 ] Gerard Maas commented on SPARK-4940: From the perspective of evenly allocating Spark Streaming consumers (network-bound), the ideal solution would be to explicitly set the number of hosts. With the current resource allocation policy, we can have eg. (4),(1),(1) consumers over 3 hosts, instead of the ideal (2),(2),(2). Given that the resource allocation is dynamic at job startup time, this results in variable performance characteristic for the job being submitted. In practice, we have been restarting the job (using Marathon) until we get a favorable resource allocation. Not sure how well the requirement of a fix amount of executors would fit with the node transparency offered by Mesos. I'm just trying to elaborate on the requirements from the Spark Streaming job perspective. Document or Support more evenly distributing cores for Mesos mode - Key: SPARK-4940 URL: https://issues.apache.org/jira/browse/SPARK-4940 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in Coarse grain mode the spark scheduler simply takes all the resources it can on each node, but can cause uneven distribution based on resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value
[ https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264558#comment-14264558 ] Kai Sasaki commented on SPARK-5073: --- I did not notice above comment. Sorry, I've just created PR for this issue. spark.storage.memoryMapThreshold have two default value - Key: SPARK-5073 URL: https://issues.apache.org/jira/browse/SPARK-5073 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Yuan Jianhui Priority: Minor In org.apache.spark.storage.DiskStore: val minMemoryMapBytes = blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L) In org.apache.spark.network.util.TransportConf: public int memoryMapBytes() { return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 1024); } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4897) Python 3 support
[ https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264643#comment-14264643 ] Matthew Cornell commented on SPARK-4897: Please!! Python 3 support Key: SPARK-4897 URL: https://issues.apache.org/jira/browse/SPARK-4897 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Rosen Priority: Minor It would be nice to have Python 3 support in PySpark, provided that we can do it in a way that maintains backwards-compatibility with Python 2.6. I started looking into porting this; my WIP work can be found at https://github.com/JoshRosen/spark/compare/python3 I was able to use the [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] tool to handle the basic conversion of things like {{print}} statements, etc. and had to manually fix up a few imports for packages that moved / were renamed, but the major blocker that I hit was {{cloudpickle}}: {code} [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark Python 3.4.2 (default, Oct 19 2014, 17:52:17) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin Type help, copyright, credits or license for more information. Traceback (most recent call last): File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, in module import pyspark File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 41, in module from pyspark.context import SparkContext File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, in module from pyspark import accumulators File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, line 97, in module from pyspark.cloudpickle import CloudPickler File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 120, in module class CloudPickler(pickle.Pickler): File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 122, in CloudPickler dispatch = pickle.Pickler.dispatch.copy() AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch' {code} This code looks like it will be hard difficult to port to Python 3, so this might be a good reason to switch to [Dill|https://github.com/uqfoundation/dill] for Python serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Haberman updated SPARK-5085: Description: In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job. It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works. Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts): {code} // lots of log output, doing fine... 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056 // note 15m of silence here... 15/01/03 05:48:13 WARN [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, chunkIndex=52}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/2d/shuffle_4_520_0.data, offset=2214139, length=20607}} to /10.33.166.218:42780; closing connection java.nio.channels.ClosedChannelException 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, chunkIndex=53}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/10/shuffle_4_524_0.data, offset=2215548, length=23998}} to /10.33.166.218:42780; closing connection java.nio.channels.ClosedChannelException 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, chunkIndex=54}, buffer=FileSegmentManagedBuffer{file=/mnt/spark/local/spark-local-20150103040327-4f92/32/shuffle_4_532_0.data, offset=2248230, length=20580}} to /10.33.166.218:42780; closing connection java.nio.channels.ClosedChannelException // lots more of these... {code} Note how, up through 5:33, everything was fine, then after ~15 minutes of silence, at 5:48, the shuffle-server connection times out, and all of that server-7's requests fail. Here is shuffle-server-1 from the same stdout (with 1 last ClosedChannelException
[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-5089: -- Description: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:python} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:python} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! {code} But this works fine: {code:python} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. was: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). `` if ar.dtype != np.float64: ar.astype(np.float64) `` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. Vector conversion broken for non-float64 arrays --- Key: SPARK-5089 URL: https://issues.apache.org/jira/browse/SPARK-5089 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Jeremy Freeman Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:python} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:python} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! {code} But this works fine: {code:python} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-5089: -- Description: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:none} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:none} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! {code} But this works fine: {code:none} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. was: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:python} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:python} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! {code} But this works fine: {code:python} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. Vector conversion broken for non-float64 arrays --- Key: SPARK-5089 URL: https://issues.apache.org/jira/browse/SPARK-5089 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Jeremy Freeman Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:none} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:none} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! {code} But this works fine: {code:none} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5089) Vector conversion broken for non-float64 arrays
Jeremy Freeman created SPARK-5089: - Summary: Vector conversion broken for non-float64 arrays Key: SPARK-5089 URL: https://issues.apache.org/jira/browse/SPARK-5089 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Jeremy Freeman Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). ``` if ar.dtype != np.float64: ar.astype(np.float64) ``` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-5089: -- Description: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). `` if ar.dtype != np.float64: ar.astype(np.float64) `` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. was: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). ``` if ar.dtype != np.float64: ar.astype(np.float64) ``` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. Vector conversion broken for non-float64 arrays --- Key: SPARK-5089 URL: https://issues.apache.org/jira/browse/SPARK-5089 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Jeremy Freeman Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). `` if ar.dtype != np.float64: ar.astype(np.float64) `` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4898) Replace cloudpickle with Dill
[ https://issues.apache.org/jira/browse/SPARK-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264801#comment-14264801 ] Nicholas Chammas commented on SPARK-4898: - cc [~davies] Replace cloudpickle with Dill - Key: SPARK-4898 URL: https://issues.apache.org/jira/browse/SPARK-4898 Project: Spark Issue Type: Bug Components: PySpark Reporter: Josh Rosen We should consider replacing our modified version of {{cloudpickle}} with [Dill|https://github.com/uqfoundation/dill], since it supports both Python 2 and 3 and might do a better job of handling certain corner-cases. I attempted to do this a few months ago but ran into cases where Dill had issues pickling objects defined in doctests, which broke our test suite: https://github.com/uqfoundation/dill/issues/50. This issue may have been resolved now; I haven't checked. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265228#comment-14265228 ] Hari Shreedharan commented on SPARK-4905: - I can't reproduce this - but once Jenkins is back, I will look at the logs to see if there is any info there. Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at
[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265194#comment-14265194 ] Stephen Haberman commented on SPARK-5085: - I think I've found a good clue: {code} [ 2527.610744] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2555.073922] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2606.652438] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2615.427676] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2626.008450] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2742.996355] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2744.434263] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2744.623440] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.204023] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.470360] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.517182] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.616516] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2818.547464] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2824.525844] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2831.868035] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2833.644154] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2895.396963] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2896.939451] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2901.464337] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2902.461459] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2904.840728] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2908.156252] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2908.925033] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2933.240541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2975.218843] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2975.333279] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2980.533872] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2984.017055] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2984.107575] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2991.252054] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2993.965474] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3000.521793] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3057.080236] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3067.674541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3077.984465] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3139.590085] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3140.145975] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3217.729824] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3249.614154] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3252.775976] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3261.234940] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3302.538848] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3325.811720] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3332.873067] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3340.724759] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3349.646235] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3354.857573] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3361.728122] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3373.623622] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3384.29] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3394.701554] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3402.048682] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3408.972487] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3415.697781] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3415.746289] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3428.234060] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3438.317541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3467.061761] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3592.827300] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3598.320551] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3601.487113] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3636.656200] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3670.347676] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3672.573875] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3720.392902] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3748.385374] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3763.997229] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3776.472560] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3783.343165] xen_netfront: xennet: skb rides the rocket: 19 slots [
[jira] [Commented] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler
[ https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265267#comment-14265267 ] Matt Cheah commented on SPARK-4737: --- I will be out of the office with limited access to e-mail from January 05 to January 06. If there are specifically urgent matters requiring my assistance and you have other means of contacting me, please use those other channels. Sorry for the inconvenience. Thanks, -Matt Cheah Prevent serialization errors from ever crashing the DAG scheduler - Key: SPARK-4737 URL: https://issues.apache.org/jira/browse/SPARK-4737 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Patrick Wendell Assignee: Matthew Cheah Priority: Blocker Currently in Spark we assume that when tasks are serialized in the TaskSetManager that the serialization cannot fail. We assume this because upstream in the DAGScheduler we attempt to catch any serialization errors by serializing a single partition. However, in some cases this upstream test is not accurate - i.e. an RDD can have one partition that can serialize cleanly but not others. Do do this in the proper way we need to catch and propagate the exception at the time of serialization. The tricky bit is making sure it gets propagated in the right way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5055) Minor typos on the downloads page
[ https://issues.apache.org/jira/browse/SPARK-5055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neelesh Srinivas Salian updated SPARK-5055: --- Attachment: Spark_DownloadsPage_FixedTypos.html Here is the HTML page with the typos corrected. Please let me know if this is the right way to do it. Thank you. Minor typos on the downloads page - Key: SPARK-5055 URL: https://issues.apache.org/jira/browse/SPARK-5055 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0 Reporter: Marko Bonaci Priority: Trivial Labels: documentation, newbie Attachments: Spark_DownloadsPage_FixedTypos.html Original Estimate: 1m Remaining Estimate: 1m The _Downloads_ page uses the word Chose for present. It should say Choose. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers
[ https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265290#comment-14265290 ] Matthew Farrellee commented on SPARK-927: - PR #2313 was subsumed by PR #3351, which resolved SPARK-4477 and this issue the resolution was to remove the use of numpy altogether PySpark sample() doesn't work if numpy is installed on master but not on workers Key: SPARK-927 URL: https://issues.apache.org/jira/browse/SPARK-927 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2 Reporter: Josh Rosen Assignee: Matthew Farrellee Priority: Minor PySpark's sample() method crashes with ImportErrors on the workers if numpy is installed on the driver machine but not on the workers. I'm not sure what's the best way to fix this. A general mechanism for automatically shipping libraries from the master to the workers would address this, but that could be complicated to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265157#comment-14265157 ] Michael Armbrust commented on SPARK-4943: - I wouldn't say that the notion of {{tableIdentifier: Seq\[String\]}} is unclear. Instead I would say that it is deliberately unspecified in order to be flexible. Some systems have {{clusters}}, some systems have {{databases}}, some systems have {{schema}}, some have {{tables}}. Thus, this API gives us one interface for communicating between the parser and the underlying datastore that makes no assumption about how that datastore is laid out. If we do make this change then yes, I agree that we should also make it in the catalog as well. In general our handling of this has always been a little clunky since there is a whole bunch of code that just ignores the database field. One question is: what parts of the table identifier Spark SQL handles and what parts we pass on to the datasource? A goal here should be to be able to connect to and join data from multiple sources. Here is what I would propose as an addition to the current API, which only lets you register individual tables. - Users can register external catalogs which are responsible for producing {{BaseRelation}}s. - Each external catalog has a user specified name that is given when registering. - There is a notion of the current catalog, which can be changed with {{USE}}. By default, we pass the all the {{tableIdentifiers}} to this default catalog and its up to it to determine what each part means. - Users can also specify fully qualified tables when joining multiple data sources. We detect this case when the first {{tableIdentifier}} matches one of the registered catalogs. In this case we strip of the catalog name and pass the remaining {{tableIdentifiers}} to the specified catalog. What do you think? Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265212#comment-14265212 ] Alex Liu commented on SPARK-4943: - Catalog part of table identifier should be handled by Spark SQL which calls the registered catalog Context to connect to the underline datasources. cluster/database/scheme/table should be handled by datasource(Cassandra Spark SQL integration can handle cluster, database and table level join). e.g. {code} SELECT test1.a, test1.b, test2.c FROM cassandra.cluster.database.table1 AS test1 LEFT OUTER JOIN mySql.cluster.database.table2 AS test2 ON test1.a = test2.a {code} so cluster.database.table1 is passed to cassandra catalog datasource, cassandra is handled by Spark SQL to call cassandraContext which then call the underline datasource. cluster.database.table2 is passed to mySql catalog datasource, mySql is handled by Spark SQL to call the mySqlContext which then call the underline datasource. If USE command is used, then all tableIdentifiers are passed to datasource. e.g. {code} USE cassandra SELECT test1.a, test1.b, test2.c FROM cluster1.database.table AS test1 LEFT OUTER JOIN cluster2.database.table AS test2 ON test1.a = test2.a {code} cluster1.database.table1 and cluster2.database.table are passed to cassandra datasource Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at
[jira] [Resolved] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers
[ https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Farrellee resolved SPARK-927. - Resolution: Fixed Fix Version/s: 1.2.0 PySpark sample() doesn't work if numpy is installed on master but not on workers Key: SPARK-927 URL: https://issues.apache.org/jira/browse/SPARK-927 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2 Reporter: Josh Rosen Assignee: Matthew Farrellee Priority: Minor Fix For: 1.2.0 PySpark's sample() method crashes with ImportErrors on the workers if numpy is installed on the driver machine but not on the workers. I'm not sure what's the best way to fix this. A general mechanism for automatically shipping libraries from the master to the workers would address this, but that could be complicated to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265304#comment-14265304 ] Tathagata Das commented on SPARK-4905: -- What is the reason behind such a behavior where the number of records received is same as sent, but all the records are empty? TD Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at
[jira] [Created] (SPARK-5094) Python API for gradient-boosted trees
Xiangrui Meng created SPARK-5094: Summary: Python API for gradient-boosted trees Key: SPARK-5094 URL: https://issues.apache.org/jira/browse/SPARK-5094 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265194#comment-14265194 ] Stephen Haberman edited comment on SPARK-5085 at 1/5/15 10:14 PM: -- I think I've found a good clue (from the dmesg logs): {code} [ 2527.610744] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2555.073922] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2606.652438] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2615.427676] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2626.008450] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2742.996355] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2744.434263] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2744.623440] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.204023] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.470360] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.517182] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2745.616516] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2818.547464] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2824.525844] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2831.868035] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2833.644154] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2895.396963] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2896.939451] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2901.464337] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2902.461459] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2904.840728] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2908.156252] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2908.925033] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2933.240541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2975.218843] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2975.333279] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2980.533872] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2984.017055] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2984.107575] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2991.252054] xen_netfront: xennet: skb rides the rocket: 19 slots [ 2993.965474] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3000.521793] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3057.080236] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3067.674541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3077.984465] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3139.590085] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3140.145975] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3217.729824] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3249.614154] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3252.775976] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3261.234940] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3302.538848] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3325.811720] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3332.873067] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3340.724759] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3349.646235] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3354.857573] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3361.728122] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3373.623622] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3384.29] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3394.701554] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3402.048682] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3408.972487] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3415.697781] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3415.746289] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3428.234060] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3438.317541] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3467.061761] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3592.827300] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3598.320551] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3601.487113] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3636.656200] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3670.347676] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3672.573875] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3720.392902] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3748.385374] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3763.997229] xen_netfront: xennet: skb rides the rocket: 19 slots [ 3776.472560] xen_netfront: xennet: skb rides the rocket: 19 slots [
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265224#comment-14265224 ] Alex Liu commented on SPARK-4943: - For each catalog, the configuration settings should start with catalog name. e.g. {noformat} set cassandra.cluster.database.table.ttl = 1000 set cassandra.database.table.ttl =1000 (default cluster) set mysql.cluster.database.table.xxx = 200 {noformat} If there's no catalog in the setting string, use the default catalog. Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) [info] at
[jira] [Updated] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler
[ https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4737: --- Affects Version/s: 1.0.2 1.1.1 Prevent serialization errors from ever crashing the DAG scheduler - Key: SPARK-4737 URL: https://issues.apache.org/jira/browse/SPARK-4737 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Patrick Wendell Assignee: Matthew Cheah Priority: Blocker Currently in Spark we assume that when tasks are serialized in the TaskSetManager that the serialization cannot fail. We assume this because upstream in the DAGScheduler we attempt to catch any serialization errors by serializing a single partition. However, in some cases this upstream test is not accurate - i.e. an RDD can have one partition that can serialize cleanly but not others. Do do this in the proper way we need to catch and propagate the exception at the time of serialization. The tricky bit is making sure it gets propagated in the right way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5093) Make network related timeouts consistent
[ https://issues.apache.org/jira/browse/SPARK-5093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5093. Resolution: Fixed Fix Version/s: 1.3.0 Make network related timeouts consistent Key: SPARK-5093 URL: https://issues.apache.org/jira/browse/SPARK-5093 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 We have a few instances of spark.network.timeout that are using different default settings (e.g. 45s in block manager and 100s in shuffle). This proposes to make them consistent at 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265322#comment-14265322 ] Hari Shreedharan commented on SPARK-4905: - I am not sure. It might have something to do with the encoding/decoding. The events even have the headers, but the body is empty (there is comparison of the headers for each event) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46) at
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265335#comment-14265335 ] Michael Armbrust commented on SPARK-4943: - Thanks for your comments Alex. Are you proposing any changes to what I said? Another thing I'm confused about is your comment regarding joins. As of now there is no public API for passing that kind of information down into a datasource. Regarding the configuration. We will pass the datasource a SQLContext and you can do {{.getConf}} using whatever arbitrary string you want. I don't think Spark SQL needs to have any control here. Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at
[jira] [Commented] (SPARK-5055) Minor typos on the downloads page
[ https://issues.apache.org/jira/browse/SPARK-5055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265331#comment-14265331 ] Sean Owen commented on SPARK-5055: -- Normally the right way to propose a change is to open a pull request against master at github.com/apache/spark. This site is hosted, however, in Apache SVN. You can post a patch against HEAD here for site changes. But, in this case, the change is so trivial that the person who will commit the change will just make the 3 edits and be done. Minor typos on the downloads page - Key: SPARK-5055 URL: https://issues.apache.org/jira/browse/SPARK-5055 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0 Reporter: Marko Bonaci Priority: Trivial Labels: documentation, newbie Attachments: Spark_DownloadsPage_FixedTypos.html Original Estimate: 1m Remaining Estimate: 1m The _Downloads_ page uses the word Chose for present. It should say Choose. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265344#comment-14265344 ] Bert Greevenbosch edited comment on SPARK-2352 at 1/5/15 11:36 PM: --- Hi Nathan, Great to hear of your interest! The pull request is quite active. You can see it and related discussion here: https://github.com/apache/spark/pull/1290 Best regards, Bert was (Author: bgreeven): Hi Nathan, Great to year of your interest! The pull request is quite active. You can see it and related discussion here: https://github.com/apache/spark/pull/1290 Best regards, Bert [MLLIB] Add Artificial Neural Network (ANN) to Spark Key: SPARK-2352 URL: https://issues.apache.org/jira/browse/SPARK-2352 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch Assignee: Bert Greevenbosch It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5040) Support expressing unresolved attributes using $attribute name notation in SQL DSL
[ https://issues.apache.org/jira/browse/SPARK-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5040. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3862 [https://github.com/apache/spark/pull/3862] Support expressing unresolved attributes using $attribute name notation in SQL DSL Key: SPARK-5040 URL: https://issues.apache.org/jira/browse/SPARK-5040 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 The SQL DSL uses Scala symbols to represent attributes, e.g. 'attr. Symbols, however, cannot capture attributes with spaces or uncommon characters. Here we suggest supporting $attribute name as an alternative. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated SPARK-4940: Summary: Support more evenly distributing cores for Mesos mode (was: Document or Support more evenly distributing cores for Mesos mode) Support more evenly distributing cores for Mesos mode - Key: SPARK-4940 URL: https://issues.apache.org/jira/browse/SPARK-4940 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in Coarse grain mode the spark scheduler simply takes all the resources it can on each node, but can cause uneven distribution based on resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated SPARK-5095: Description: Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node was: Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. This is also applicable when users want to specifiy number of executors to be launched on each node -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265344#comment-14265344 ] Bert Greevenbosch commented on SPARK-2352: -- Hi Nathan, Great to year of your interest! The pull request is quite active. You can see it and related discussion here: https://github.com/apache/spark/pull/1290 Best regards, Bert [MLLIB] Add Artificial Neural Network (ANN) to Spark Key: SPARK-2352 URL: https://issues.apache.org/jira/browse/SPARK-2352 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch Assignee: Bert Greevenbosch It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265390#comment-14265390 ] Sean Owen commented on SPARK-1517: -- Recap: old URL was building-with-maven.html, new URL is building-spark.html to match a rename and content change of the page itself a few months ago. There should be a redirect from the former to latter. Until the 1.2.0 site was published, there was no building-spark.html page live on the site. So README.md had to link to building-with-maven.html, with the intent that after 1.2.0 this would just redirect to building-spark.html. I'm not sure why, but the redirect isn't working. It redirects to http://spark.apache.org/building-spark.html . It seems like this is some default mechanism, and the redirector that the plugin is supposed to generate isn't present or something. Could somehow be my mistake but I certainly recall it worked on my local build of the site or else I never would have proposed it. So yes one direct hotfix is to change links to the old page to links to the new page. Only one of two links in README.md was updated. It's easy to fix the other. The README.md that you see on github.com is always going to be from master, but people are going to encounter the page and sometimes expect it corresponds to a latest stable release. (You can always view README.md from the branch you want of course, if you know what you're doing.) Yes, for this reason I agree that it's best to make it mostly pointers to other information, and I think that was already the intent of changes that included the renaming I alluded to above. IIRC there was a desire to not strip down README.md further and leave some minimal, hopefully fairly unchanging, info there. Whether there should be nightly builds of the site is a different question. If you linked to nightly instead of latest I suppose you'd have more of the same problem, no? people finding the github site and perhaps thinking they are seeing latest stable docs? On the other hand, it would at least be more internally consistent. On the other other hand, would you have to change the links to the stable URLs for release and then back as part of the release process? I had thought just linking to latest stable release docs was simple and fine. Publish nightly snapshots of documentation, maven artifacts, and binary builds -- Key: SPARK-1517 URL: https://issues.apache.org/jira/browse/SPARK-1517 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Patrick Wendell Priority: Blocker Should be pretty easy to do with Jenkins. The only thing I can think of that would be tricky is to set up credentials so that jenkins can publish this stuff somewhere on apache infra. Ideally we don't want to have to put a private key on every jenkins box (since they are otherwise pretty stateless). One idea is to encrypt these credentials with a passphrase and post them somewhere publicly visible. Then the jenkins build can download the credentials provided we set a passphrase in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4296: Priority: Blocker (was: Critical) Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265351#comment-14265351 ] Alex Liu commented on SPARK-4943: - No changes to your approach. Regarding cluster1.database.table1 and cluster2.database.table passing to datasources. they are set as tableIdentifier and tableIdentifier is passed to catalog.lookupRelation method where datasource can use it. Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265533#comment-14265533 ] Tathagata Das commented on SPARK-4960: -- The reason we have I am suggesting the limited solution is that there are usecases where even T = Iterator[T[ helps. People have asked me before whether the data received from a source can be filtered even before it has been inserted to reduce memory usage. Others have also asked if they can do very low latency stuff like pushing received data out to some other store immediately for greater reliability. This simplified interceptor pattern can solve those. However, yes, it does not solve [~c...@koeninger.org] requirement. That should best be solved using a new Receiver and InputDStream. This limited solution can be implemented without another receiver. The interceptor function, if set, can be applied by the BlockGenerator to every records it is getting. And since we want everyone to use the BlockGenerator, all receivers will be able to take advantage of this interceptor. Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265355#comment-14265355 ] Ryan Williams commented on SPARK-1517: -- Hey [~pwendell], any updates here? The disconnect between the content of the github README and the /latest/ published docs leading up to the 1.2.0 release continues to cast a shadow and new divergence is set to begin as we move further from having just cut a release. As was [recently pointed out on the dev list|http://apache-spark-developers-list.1001551.n3.nabble.com/Starting-with-Spark-tp9908p9925.html], [my|https://github.com/apache/spark/commit/4ceb048b38949dd0a909d2ee6777607341c9c93a#diff-04c6e90faac2675aa89e2176d2eec7d8] and [Reynold's|https://github.com/apache/spark/commit/342b57db66e379c475daf5399baf680ff42b87c2#diff-04c6e90faac2675aa89e2176d2eec7d8] fixes to previously-broken links in the README became broken links when the 1.2.0 docs were cut, as [~srowen] [warned would happen|https://github.com/apache/spark/commit/342b57db66e379c475daf5399baf680ff42b87c2#commitcomment-8250912] (one is fixed [here|https://github.com/apache/spark/pull/3802/files], the other remains broken on the README today). I still believe that the correct fix is to have the README point at docs that are published with each Jenkins build, per this JIRA and [our previous discussion about it|http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-tt9560.html#a9568]. Even better would be to publish nightly docs *and* remove any pretense that the github README is a canonical source of documentation, in favor of just linking to the /latest/ published docs. Let me know if you want me to file that as a sub-task here. Publish nightly snapshots of documentation, maven artifacts, and binary builds -- Key: SPARK-1517 URL: https://issues.apache.org/jira/browse/SPARK-1517 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Patrick Wendell Priority: Blocker Should be pretty easy to do with Jenkins. The only thing I can think of that would be tricky is to set up credentials so that jenkins can publish this stuff somewhere on apache infra. Ideally we don't want to have to put a private key on every jenkins box (since they are otherwise pretty stateless). One idea is to encrypt these credentials with a passphrase and post them somewhere publicly visible. Then the jenkins build can download the credentials provided we set a passphrase in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
Timothy Chen created SPARK-5095: --- Summary: Support launching multiple mesos executors in coarse grained mesos mode Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode
[ https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated SPARK-5095: Component/s: Mesos Support launching multiple mesos executors in coarse grained mesos mode --- Key: SPARK-5095 URL: https://issues.apache.org/jira/browse/SPARK-5095 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Currently in coarse grained mesos mode, it's expected that we only launch one Mesos executor that launches one JVM process to launch multiple spark executors. However, this become a problem when the JVM process launched is larger than an ideal size (30gb is recommended value from databricks), which causes GC problems reported on the mailing list. We should support launching mulitple executors when large enough resources are available for spark to use, and these resources are still under the configured limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler
[ https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-4737: -- Comment: was deleted (was: I will be out of the office with limited access to e-mail from January 05 to January 06. If there are specifically urgent matters requiring my assistance and you have other means of contacting me, please use those other channels. Sorry for the inconvenience. Thanks, -Matt Cheah ) Prevent serialization errors from ever crashing the DAG scheduler - Key: SPARK-4737 URL: https://issues.apache.org/jira/browse/SPARK-4737 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Patrick Wendell Assignee: Matthew Cheah Priority: Blocker Currently in Spark we assume that when tasks are serialized in the TaskSetManager that the serialization cannot fail. We assume this because upstream in the DAGScheduler we attempt to catch any serialization errors by serializing a single partition. However, in some cases this upstream test is not accurate - i.e. an RDD can have one partition that can serialize cleanly but not others. Do do this in the proper way we need to catch and propagate the exception at the time of serialization. The tricky bit is making sure it gets propagated in the right way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information
[ https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265416#comment-14265416 ] Patrick Wendell commented on SPARK-4687: I spent some more time looking at this and talking with [~sandyr] and [~joshrosen]. I think having some limited version of this is fine given that, from what I can tell, this is pretty difficult to implement outside of Spark. I am going to post further comments on the JIRA. SparkContext#addFile doesn't keep file folder information - Key: SPARK-4687 URL: https://issues.apache.org/jira/browse/SPARK-4687 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Jimmy Xiang Files added with SparkContext#addFile are loaded with Utils#fetchFile before a task starts. However, Utils#fetchFile puts all files under the Spart root on the worker node. We should have an option to keep the folder information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265419#comment-14265419 ] Ryan Williams commented on SPARK-1517: -- Agreed that the redirect you speak of should exist / be fixed; a separate JIRA should be filed for that. bq. Whether there should be nightly builds of the site is a different question. My understanding is that there has been consensus at a few points that this is a good idea. The main concern you've voiced is the risk that people will land on the github README when looking for stable/release docs, and: # find nightly info directly in the README content (and not understand it to be incorrect (too up-to-date) for their purposes), or # inadvertently follow a link to published nightly docs. re: 1, in my last post I suggested doing away with the pretense that the github README will directly contain Spark documentation, and replacing its current content with links to the relevant published docs, potentially *both* nightly and stable. re: 2, as long as the README's links to nightly and stable docs sites are clearly marked, this should not be a problem. Users already must have a minimal level of understanding of what version of Spark docs they want to look at. Publish nightly snapshots of documentation, maven artifacts, and binary builds -- Key: SPARK-1517 URL: https://issues.apache.org/jira/browse/SPARK-1517 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Patrick Wendell Priority: Blocker Should be pretty easy to do with Jenkins. The only thing I can think of that would be tricky is to set up credentials so that jenkins can publish this stuff somewhere on apache infra. Ideally we don't want to have to put a private key on every jenkins box (since they are otherwise pretty stateless). One idea is to encrypt these credentials with a passphrase and post them somewhere publicly visible. Then the jenkins build can download the credentials provided we set a passphrase in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265472#comment-14265472 ] Stephen Haberman commented on SPARK-5085: - Looks like this is probably a known EC2/Ubuntu/Linux/Xen issue: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811 http://www.spinics.net/lists/netdev/msg282340.html I'm trying to run with those flags (tso/sg) off to see if that fixes it. netty shuffle service causing connection timeouts - Key: SPARK-5085 URL: https://issues.apache.org/jira/browse/SPARK-5085 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0 Here's our spark-defaults: {code} spark.master spark://$MASTER_IP:7077 spark.eventLog.enabled true spark.eventLog.dir /mnt/spark/work/history spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory ${EXECUTOR_MEM}m spark.core.connection.ack.wait.timeout 600 spark.storage.blockManagerSlaveTimeoutMs 6 spark.shuffle.consolidateFiles true spark.shuffle.service.enabled false spark.shuffle.blockTransferService nio # works with nio, fails with netty # Use snappy because LZF uses ~100-300k buffer per block spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 10 spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati... spark.akka.logLifecycleEvents true spark.akka.timeout 360 spark.akka.askTimeout 120 spark.akka.lookupTimeout 120 spark.akka.frameSize 100 spark.files.userClassPathFirst true spark.shuffle.memoryFraction 0.5 spark.storage.memoryFraction 0.2 {code} Reporter: Stephen Haberman In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job. It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works. Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts): {code} // lots of log output, doing fine... 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056 // note 15m of silence here... 15/01/03 05:48:13 WARN [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265508#comment-14265508 ] Saisai Shao commented on SPARK-4960: Hi TD, thanks a lot for your suggestions. What I concerned about changing the interceptor pattern to T = Iterator\[T\] is that it will narrow the usage scenario of this functionality, at least it cannot meet the requirement [~c...@koeninger.org] mentioned for KafkaInputDStream. Is it necessary to add just T = Iterator\[T\] pattern in receiver rather than moving into DStream transformation? For the problem you mentioned, I think maybe we can create a interceptor receiver to ship the data from actual receiver and do some transformation before storing, so each actual receiver will have a shadow interceptor receiver to intercept the data, this might fix the problem but will increase the implementation complexity. I will re-think about my design and try to figure out a better way if possible. Thanks a lot. Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization
[ https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265549#comment-14265549 ] yangping wu commented on SPARK-5082: Yes, I also found the pull after I created the issue. Minor typo in the Tuning Spark document about Data Serialization Key: SPARK-5082 URL: https://issues.apache.org/jira/browse/SPARK-5082 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: yangping wu Priority: Trivial Original Estimate: 8h Remaining Estimate: 8h The latest documentation for *Tuning Spark* has some error entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data Serialization*: section: {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* not *Seq[Class[_]]*. The right code snippets is {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265548#comment-14265548 ] Saisai Shao commented on SPARK-4960: Thanks TD, this sounds reasonable, I will refactor the design doc and I think we should take SPARK-5042 into consideration. Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization
[ https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265550#comment-14265550 ] yangping wu commented on SPARK-5082: Yes, I also found the pull after I created the issue. Minor typo in the Tuning Spark document about Data Serialization Key: SPARK-5082 URL: https://issues.apache.org/jira/browse/SPARK-5082 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: yangping wu Priority: Trivial Original Estimate: 8h Remaining Estimate: 8h The latest documentation for *Tuning Spark* has some error entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data Serialization*: section: {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* not *Seq[Class[_]]*. The right code snippets is {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization
[ https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated SPARK-5082: --- Comment: was deleted (was: Yes, I also found the pull after I created the issue.) Minor typo in the Tuning Spark document about Data Serialization Key: SPARK-5082 URL: https://issues.apache.org/jira/browse/SPARK-5082 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.2.0 Reporter: yangping wu Priority: Trivial Original Estimate: 8h Remaining Estimate: 8h The latest documentation for *Tuning Spark* has some error entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data Serialization*: section: {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* not *Seq[Class[_]]*. The right code snippets is {code} val conf = new SparkConf().setMaster(...).setAppName(...) conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) val sc = new SparkContext(conf) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4258) NPE with new Parquet Filters
[ https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-4258: - Oops this got closed by your documentation fix. Reopening. NPE with new Parquet Filters Key: SPARK-4258 URL: https://issues.apache.org/jira/browse/SPARK-4258 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): java.lang.NullPointerException: parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Or.accept(Operators.java:302) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$And.accept(Operators.java:290) parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) {code} This occurs when reading parquet data encoded with the older version of the library for TPC-DS query 34. Will work on coming up with a smaller reproduction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265570#comment-14265570 ] Cody Koeninger commented on SPARK-4960: --- At that point, it sounds like you're talking about an early filter rather than an early flatmap. Should it just be T = Option[T]? Since this ticket no longer solves the problem raised by SPARK-3146, and it seems unlikely that patch will ever get merged, what is the concrete plan for giving users of receiver-based kafka implementations early access to MessageAndMetadata? On Mon, Jan 5, 2015 at 7:48 PM, Tathagata Das (JIRA) j...@apache.org Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir
Michael Armbrust created SPARK-5096: --- Summary: SparkBuild.scala assumes you are at the spark root dir Key: SPARK-5096 URL: https://issues.apache.org/jira/browse/SPARK-5096 Project: Spark Issue Type: Bug Components: Build Reporter: Michael Armbrust Assignee: Michael Armbrust This is bad because it breaks compiling spark as an external project ref and is generally bad SBT practice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir
[ https://issues.apache.org/jira/browse/SPARK-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265666#comment-14265666 ] Apache Spark commented on SPARK-5096: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3905 SparkBuild.scala assumes you are at the spark root dir -- Key: SPARK-5096 URL: https://issues.apache.org/jira/browse/SPARK-5096 Project: Spark Issue Type: Bug Components: Build Reporter: Michael Armbrust Assignee: Michael Armbrust This is bad because it breaks compiling spark as an external project ref and is generally bad SBT practice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265642#comment-14265642 ] Stephen Haberman commented on SPARK-5085: - I've confirmed that adding: sudo ethtool -K eth0 tso off sudo ethtool -K eth0 sg off To our cluster setup script fixed the issue and the job runs perfectly now. (I had initially tried only tso off; that did not fix it, tso off + sg off did fix it. I have not tried only sg off, as from running ethtool, it my naive/quick interpretation was that sg off implied tso off (but don't hold me to that). Closing this ticket. netty shuffle service causing connection timeouts - Key: SPARK-5085 URL: https://issues.apache.org/jira/browse/SPARK-5085 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0 Here's our spark-defaults: {code} spark.master spark://$MASTER_IP:7077 spark.eventLog.enabled true spark.eventLog.dir /mnt/spark/work/history spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory ${EXECUTOR_MEM}m spark.core.connection.ack.wait.timeout 600 spark.storage.blockManagerSlaveTimeoutMs 6 spark.shuffle.consolidateFiles true spark.shuffle.service.enabled false spark.shuffle.blockTransferService nio # works with nio, fails with netty # Use snappy because LZF uses ~100-300k buffer per block spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 10 spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati... spark.akka.logLifecycleEvents true spark.akka.timeout 360 spark.akka.askTimeout 120 spark.akka.lookupTimeout 120 spark.akka.frameSize 100 spark.files.userClassPathFirst true spark.shuffle.memoryFraction 0.5 spark.storage.memoryFraction 0.2 {code} Reporter: Stephen Haberman In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job. It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works. Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts): {code} // lots of log output, doing fine... 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056 // note 15m of silence here... 15/01/03 05:48:13 WARN [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
[jira] [Closed] (SPARK-5085) netty shuffle service causing connection timeouts
[ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Haberman closed SPARK-5085. --- Resolution: Invalid netty shuffle service causing connection timeouts - Key: SPARK-5085 URL: https://issues.apache.org/jira/browse/SPARK-5085 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0 Here's our spark-defaults: {code} spark.master spark://$MASTER_IP:7077 spark.eventLog.enabled true spark.eventLog.dir /mnt/spark/work/history spark.serializer org.apache.spark.serializer.KryoSerializer spark.executor.memory ${EXECUTOR_MEM}m spark.core.connection.ack.wait.timeout 600 spark.storage.blockManagerSlaveTimeoutMs 6 spark.shuffle.consolidateFiles true spark.shuffle.service.enabled false spark.shuffle.blockTransferService nio # works with nio, fails with netty # Use snappy because LZF uses ~100-300k buffer per block spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec spark.shuffle.file.buffer.kb 10 spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati... spark.akka.logLifecycleEvents true spark.akka.timeout 360 spark.akka.askTimeout 120 spark.akka.lookupTimeout 120 spark.akka.frameSize 100 spark.files.userClassPathFirst true spark.shuffle.memoryFraction 0.5 spark.storage.memoryFraction 0.2 {code} Reporter: Stephen Haberman In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job. It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works. Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts): {code} // lots of log output, doing fine... 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170} 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056 // note 15m of silence here... 15/01/03 05:48:13 WARN [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result
[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers
[ https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265589#comment-14265589 ] Tathagata Das commented on SPARK-4960: -- That is a good question. What we could do is provide a new API in KafkaUtils for that. Though SPARK-4964 also tries to add more interfaces to KafkaUtils. I have to brainstorm a little bit about how all this should be organized. Interceptor pattern in receivers Key: SPARK-4960 URL: https://issues.apache.org/jira/browse/SPARK-4960 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Tathagata Das Sometimes it is good to intercept a message received through a receiver and modify / do something with the message before it is stored into Spark. This is often referred to as the interceptor pattern. There should be general way to specify an interceptor function that gets applied to all receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5090) The improvement of python converter for hbase
[ https://issues.apache.org/jira/browse/SPARK-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gen TANG updated SPARK-5090: Description: The python converter `HBaseResultToStringConverter` provided in the HBaseConverter.scala returns only the value of first column in the result. It limits the utility of this converter, because it returns only one value per row(perhaps there are several version in hbase) and moreover it loses the other information of record, such as column:cell, timestamp. Here we would like to propose an improvement about python converter which returns all the records in the results (in a single string) with more complete information. We would like also make some improvements for hbase_inputformat.py was: The python converter `HBaseResultToStringConverter` provided in the HBaseConverter.scala returns only the value of first column in the result. It limits the utility of this converter, because it returns only one value per row(perhaps there are several version in hbase) and moreover it loses the other information of record, such as column:cell, timestamp. Here we would like to propose an improvement about python converter which returns all the records in the results (in a single string) with more complete information. The improvement of python converter for hbase - Key: SPARK-5090 URL: https://issues.apache.org/jira/browse/SPARK-5090 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Gen TANG Labels: hbase, python Fix For: 1.2.1 Original Estimate: 168h Remaining Estimate: 168h The python converter `HBaseResultToStringConverter` provided in the HBaseConverter.scala returns only the value of first column in the result. It limits the utility of this converter, because it returns only one value per row(perhaps there are several version in hbase) and moreover it loses the other information of record, such as column:cell, timestamp. Here we would like to propose an improvement about python converter which returns all the records in the results (in a single string) with more complete information. We would like also make some improvements for hbase_inputformat.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5057) Log message in failed askWithReply attempts
[ https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5057. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3875 [https://github.com/apache/spark/pull/3875] Log message in failed askWithReply attempts --- Key: SPARK-5057 URL: https://issues.apache.org/jira/browse/SPARK-5057 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: WangTaoTheTonic Priority: Minor Fix For: 1.3.0 As is used in many cases, it is better for analysis to print contents of message after attempt failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5057) Log message in failed askWithReply attempts
[ https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5057: -- Summary: Log message in failed askWithReply attempts (was: Add more details in log when using actor to get infos) Log message in failed askWithReply attempts --- Key: SPARK-5057 URL: https://issues.apache.org/jira/browse/SPARK-5057 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: WangTaoTheTonic Priority: Minor Fix For: 1.3.0 As is used in many cases, it is better for analysis to print contents of message after attempt failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5057) Log message in failed askWithReply attempts
[ https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5057: -- Assignee: WangTaoTheTonic Log message in failed askWithReply attempts --- Key: SPARK-5057 URL: https://issues.apache.org/jira/browse/SPARK-5057 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.3.0 As is used in many cases, it is better for analysis to print contents of message after attempt failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4465) runAsSparkUser doesn't affect TaskRunner in Mesos environment at all.
[ https://issues.apache.org/jira/browse/SPARK-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4465: -- Assignee: Jongyoul Lee runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Key: SPARK-4465 URL: https://issues.apache.org/jira/browse/SPARK-4465 Project: Spark Issue Type: Bug Components: Deploy, Input/Output, Mesos Affects Versions: 1.2.0 Reporter: Jongyoul Lee Assignee: Jongyoul Lee Priority: Critical Fix For: 1.3.0, 1.2.1 runAsSparkUser enable classes using hadoop library to change an active user to spark User, however in Mesos environment, because the function calls before running within JNI, runAsSparkUser doesn't affect anything, and meaningless code. fix the Appropriate scope of function and remove meaningless code. That's a bug because of running program incorrectly. That's related to SPARK-3223 but setting framework user is not perfect solution in my tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4465) runAsSparkUser doesn't affect TaskRunner in Mesos environment at all.
[ https://issues.apache.org/jira/browse/SPARK-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4465. --- Resolution: Fixed Issue resolved by pull request 3741 [https://github.com/apache/spark/pull/3741] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Key: SPARK-4465 URL: https://issues.apache.org/jira/browse/SPARK-4465 Project: Spark Issue Type: Bug Components: Deploy, Input/Output, Mesos Affects Versions: 1.2.0 Reporter: Jongyoul Lee Priority: Critical Fix For: 1.3.0, 1.2.1 runAsSparkUser enable classes using hadoop library to change an active user to spark User, however in Mesos environment, because the function calls before running within JNI, runAsSparkUser doesn't affect anything, and meaningless code. fix the Appropriate scope of function and remove meaningless code. That's a bug because of running program incorrectly. That's related to SPARK-3223 but setting framework user is not perfect solution in my tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265025#comment-14265025 ] Apache Spark commented on SPARK-5089: - User 'freeman-lab' has created a pull request for this issue: https://github.com/apache/spark/pull/3902 Vector conversion broken for non-float64 arrays --- Key: SPARK-5089 URL: https://issues.apache.org/jira/browse/SPARK-5089 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Jeremy Freeman Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:none} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:none} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) 5 # should be 10! {code} But this works fine: {code:none} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5091) Hooks for PySpark tasks
Davies Liu created SPARK-5091: - Summary: Hooks for PySpark tasks Key: SPARK-5091 URL: https://issues.apache.org/jira/browse/SPARK-5091 Project: Spark Issue Type: New Feature Components: PySpark Reporter: Davies Liu Currently, it's not convenient to add package on executor to PYTHONPATH (we did not assume the environment of driver an executor are identical). It will be nice to have a hook to called before/after every tasks, then user could manipulate sys.path by pre-task-hooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5092) Selecting from a nested structure with SparkSQL should return a nested structure
Brad Willard created SPARK-5092: --- Summary: Selecting from a nested structure with SparkSQL should return a nested structure Key: SPARK-5092 URL: https://issues.apache.org/jira/browse/SPARK-5092 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Brad Willard Priority: Minor When running a sparksql query like this (at least on a json dataset) select rid, meta_data.name from a_table The rows returned lose the nested structure. I receive a row like Row(rid='123', name='delete') instead of Row(rid='123', meta_data=Row(name='data')) I personally think this is confusing especially when programmatically building and executing queries and then parsing it to find your data in a new structure. I could understand how that's less desirable in some situations, but you could get around it by supporting 'as'. If you wanted to skip the nested structure simply write. select rid, meta_data.name as name from a_table -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264862#comment-14264862 ] Cheng Lian commented on SPARK-4850: --- [~debugger87] The query you provided in the description is actually invalid, because no aggregation function is involved. The expanded {{*}} clearly contains expressions that are not in group by (all attributes except for {{a}}). If you try something like: {code} val t = sqlContext.sql(select a, count(*) from t group by a) {code} or {code} val t = sqlContext.sql(select arr, count(*) from t group by arr) {code} Then everything is fine. GROUP BY can't work if the schema of SchemaRDD contains struct or array type -- Key: SPARK-4850 URL: https://issues.apache.org/jira/browse/SPARK-4850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 Reporter: Chaozhong Yang Assignee: Cheng Lian Labels: group, sql Original Estimate: 96h Remaining Estimate: 96h Code in Spark Shell as follows: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect {code} Let's look into the schema of `Table` {code} root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) {code} Exception will be throwed: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
[jira] [Closed] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian closed SPARK-4850. - Resolution: Invalid Not a bug. GROUP BY can't work if the schema of SchemaRDD contains struct or array type -- Key: SPARK-4850 URL: https://issues.apache.org/jira/browse/SPARK-4850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 Reporter: Chaozhong Yang Assignee: Cheng Lian Labels: group, sql Original Estimate: 96h Remaining Estimate: 96h Code in Spark Shell as follows: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect {code} Let's look into the schema of `Table` {code} root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) {code} Exception will be throwed: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at
[jira] [Resolved] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-4296. --- Resolution: Duplicate Fix Version/s: 1.2.0 Target Version/s: 1.2.0 (was: 1.3.0) This issue is a duplicate of SPARK-4322, which has already been fixed in 1.2.0. Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reopened SPARK-4296: --- Sorry, missed David's comment below. Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4688) Have a single shared network timeout in Spark
[ https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4688. Resolution: Fixed Fix Version/s: 1.3.0 Have a single shared network timeout in Spark - Key: SPARK-4688 URL: https://issues.apache.org/jira/browse/SPARK-4688 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Priority: Critical Fix For: 1.3.0 We have several different timeouts, but in most cases users just want to set something that is large enough that they can avoid GC pauses. We should have a single conf spark.network.timeout that is used as the default timeout for all network interactions. This can replace the following timeouts: {code} spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs (undocumented) spark.shuffle.io.connectionTimeout (undocumented) {code} Of course, for compatibility we should respect the old ones when they are set. This idea was proposed originally by [~rxin] and I'm paraphrasing his suggestion here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4387) Refactoring python profiling code to make it extensible
[ https://issues.apache.org/jira/browse/SPARK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264919#comment-14264919 ] Apache Spark commented on SPARK-4387: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3901 Refactoring python profiling code to make it extensible --- Key: SPARK-4387 URL: https://issues.apache.org/jira/browse/SPARK-4387 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.1.0 Reporter: Yandu Oppacher SPARK-3478 introduced python profiling for workers which is great but it would be nice to be able to change the profiler and output formats as needed. This is a refactoring of the code to allow that to happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264936#comment-14264936 ] Hari Shreedharan commented on SPARK-4905: - Taking a look now. Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.flume.FlumeStreamSuite.runTest(FlumeStreamSuite.scala:46) at
[jira] [Commented] (SPARK-4688) Have a single shared network timeout in Spark
[ https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264942#comment-14264942 ] Varun Saxena commented on SPARK-4688: - Thanks [~rxin] for the commit. Mind assigning the issue to me ? Have a single shared network timeout in Spark - Key: SPARK-4688 URL: https://issues.apache.org/jira/browse/SPARK-4688 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Priority: Critical Fix For: 1.3.0 We have several different timeouts, but in most cases users just want to set something that is large enough that they can avoid GC pauses. We should have a single conf spark.network.timeout that is used as the default timeout for all network interactions. This can replace the following timeouts: {code} spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs (undocumented) spark.shuffle.io.connectionTimeout (undocumented) {code} Of course, for compatibility we should respect the old ones when they are set. This idea was proposed originally by [~rxin] and I'm paraphrasing his suggestion here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4688) Have a single shared network timeout in Spark
[ https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264942#comment-14264942 ] Varun Saxena edited comment on SPARK-4688 at 1/5/15 7:20 PM: - Thanks [~rxin] for the commit. Can you assign this issue to me ? was (Author: varun_saxena): Thanks [~rxin] for the commit. Mind assigning the issue to me ? Have a single shared network timeout in Spark - Key: SPARK-4688 URL: https://issues.apache.org/jira/browse/SPARK-4688 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Priority: Critical Fix For: 1.3.0 We have several different timeouts, but in most cases users just want to set something that is large enough that they can avoid GC pauses. We should have a single conf spark.network.timeout that is used as the default timeout for all network interactions. This can replace the following timeouts: {code} spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs (undocumented) spark.shuffle.io.connectionTimeout (undocumented) {code} Of course, for compatibility we should respect the old ones when they are set. This idea was proposed originally by [~rxin] and I'm paraphrasing his suggestion here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4757) Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource
[ https://issues.apache.org/jira/browse/SPARK-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264979#comment-14264979 ] Chris Albright commented on SPARK-4757: --- Is there an ETA on when this might make it into a release? Or, can we checkout a commit and build locally? I don't see a commit hash for the fix. Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource - Key: SPARK-4757 URL: https://issues.apache.org/jira/browse/SPARK-4757 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Jianshi Huang Fix For: 1.2.0, 1.3.0 I got the following error during Spark startup (Yarn-client mode): 14/12/04 19:33:58 INFO Client: Uploading resource file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar - hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar java.lang.IllegalArgumentException: Wrong FS: hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35) at org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140) at org.apache.spark.SparkContext.init(SparkContext.scala:335) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986) at $iwC$$iwC.init(console:9) at $iwC.init(console:18) at init(console:20) at .init(console:24) According to Liancheng and Andrew, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4688) Have a single shared network timeout in Spark
[ https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4688: --- Assignee: Varun Saxena Have a single shared network timeout in Spark - Key: SPARK-4688 URL: https://issues.apache.org/jira/browse/SPARK-4688 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Varun Saxena Priority: Critical Fix For: 1.3.0 We have several different timeouts, but in most cases users just want to set something that is large enough that they can avoid GC pauses. We should have a single conf spark.network.timeout that is used as the default timeout for all network interactions. This can replace the following timeouts: {code} spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs (undocumented) spark.shuffle.io.connectionTimeout (undocumented) {code} Of course, for compatibility we should respect the old ones when they are set. This idea was proposed originally by [~rxin] and I'm paraphrasing his suggestion here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4688) Have a single shared network timeout in Spark
[ https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264987#comment-14264987 ] Reynold Xin commented on SPARK-4688: Done - thanks for doing this. Have a single shared network timeout in Spark - Key: SPARK-4688 URL: https://issues.apache.org/jira/browse/SPARK-4688 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Varun Saxena Priority: Critical Fix For: 1.3.0 We have several different timeouts, but in most cases users just want to set something that is large enough that they can avoid GC pauses. We should have a single conf spark.network.timeout that is used as the default timeout for all network interactions. This can replace the following timeouts: {code} spark.core.connection.ack.wait.timeout spark.akka.timeout spark.storage.blockManagerSlaveTimeoutMs (undocumented) spark.shuffle.io.connectionTimeout (undocumented) {code} Of course, for compatibility we should respect the old ones when they are set. This idea was proposed originally by [~rxin] and I'm paraphrasing his suggestion here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5090) The improvement of python converter for hbase
Gen TANG created SPARK-5090: --- Summary: The improvement of python converter for hbase Key: SPARK-5090 URL: https://issues.apache.org/jira/browse/SPARK-5090 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Gen TANG Fix For: 1.2.1 The python converter `HBaseResultToStringConverter` provided in the HBaseConverter.scala returns only the value of first column in the result. It limits the utility of this converter, because it returns only one value per row(perhaps there are several version in hbase) and moreover it loses the other information of record, such as column:cell, timestamp. Here we would like to propose an improvement about python converter which returns all the records in the results (in a single string) with more complete information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265089#comment-14265089 ] Alex Liu commented on SPARK-4943: - The approach of {code} case class UnresolvedRelation(tableIdentifier: Seq[String], alias: Option[String]) {code} is a little unclear about what's stored in tableIdentifier by simply reading the code. Another approach is storing catalog.cluster.database in databaseName and tableName in tableName and keep case class no change {code} case class UnresolvedRelation(databaseName: Option[String], tableName: String, alias: Option[String]) {code} so no API changes. If we keep clusterName as a separate parameter, then API changes to {code} case class UnresolvedRelation(clusterName: Option[String], databaseName: Option[String], tableName: String, alias: Option[String]) {code} Catalog API needs change accordingly Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at