[jira] [Created] (SPARK-6802) User Defined Aggregate Function Refactoring
cynepia created SPARK-6802: -- Summary: User Defined Aggregate Function Refactoring Key: SPARK-6802 URL: https://issues.apache.org/jira/browse/SPARK-6802 Project: Spark Issue Type: Improvement Components: PySpark, SQL Environment: We use Spark Dataframe, SQL along with json, sql and pandas quite a bit Reporter: cynepia While trying to use custom aggregates in spark (something which is common in pandas), We realized that Custom Aggregate Functions aren't well supported across various features/functions in Spark beyond what is supported by Hive. There are futher discussions on the topic viz-a -viz the issue SPARK-3947, which points to similar improvement tickets opened earlier for refactoring the UDAF area. While we refactor the interface for aggregates, It would make sense to keep in consideration, the recently added DataFrame, GroupedData, and possibly also sql.dataframe.Column, which looks different from pandas.Series and isn't currently supporting any aggregations. Would like to get feedback from the folks, who are actively looking at this... We would be happy to participate and contribute, if there are any discussions on the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6028) Provide an alternative RPC implementation based on the network transport module
[ https://issues.apache.org/jira/browse/SPARK-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487329#comment-14487329 ] Andrea Peruffo commented on SPARK-6028: --- The main blocking problem in the implementation looks like: org.apache.spark.deploy.client.AppClient that needs to be heavily refactored in order to avoid the use of akka, also an abstract remote naming convention will be made available outside the Akka one (maybe hidden in the RpcEnv?). Also object RpcEnv is private and there will be no way of provide an alternative implementation of the protocol because rpcEnvNames are hard coded, in this case I will suggest the use of an assignable implicit argument instead of loading the class through Reflection. Provide an alternative RPC implementation based on the network transport module --- Key: SPARK-6028 URL: https://issues.apache.org/jira/browse/SPARK-6028 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Network transport module implements a low level RPC interface. We can build a new RPC implementation on top of that to replace Akka's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6799) Add dataframe examples for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6799: - Target Version/s: 1.4.0 Add dataframe examples for SparkR - Key: SPARK-6799 URL: https://issues.apache.org/jira/browse/SPARK-6799 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical We should add more data frame usage examples for SparkR . This can be similar to the python examples at https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6772) spark sql error when running code on large number of records
[ https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486853#comment-14486853 ] Aditya Parmar commented on SPARK-6772: -- Please find the code below import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.api.java.DataType; import org.apache.spark.sql.api.java.JavaSchemaRDD; import org.apache.spark.sql.api.java.Row; import org.apache.spark.sql.api.java.StructField; import org.apache.spark.sql.api.java.StructType; import org.apache.spark.sql.hive.api.java.JavaHiveContext; public class engineshow { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(Engine); JavaSparkContext sc = new JavaSparkContext(conf); JavaHiveContext hContext = new JavaHiveContext(sc); String sch; ListStructField fields; StructType schema; JavaRDDRow rowRDD; JavaRDDString input; JavaSchemaRDD[] inputs = new JavaSchemaRDD[2]; sch = a b c d e f g h i; // input file schema input = sc.textFile(/home/aditya/stocks1.csv); fields = new ArrayListStructField(); for (String fieldName : sch.split( )) { fields.add(DataType.createStructField(fieldName, DataType.StringType, true)); } schema = DataType.createStructType(fields); rowRDD = input.map(new FunctionString, Row() { public Row call(String record) throws Exception { String[] fields = record.split(,); Object[] fields_converted = fields; return Row.create(fields_converted); } }); inputs[0] = hContext.applySchema(rowRDD, schema); inputs[0].registerTempTable(comp1); sch = a b; fields = new ArrayListStructField(); for (String fieldName : sch.split( )) { fields.add(DataType.createStructField(fieldName, DataType.StringType, true)); } schema = DataType.createStructType(fields); inputs[1] = hContext.sql(select a,b from comp1); inputs[1].saveAsTextFile(/home/aditya/outputog); } } spark sql error when running code on large number of records Key: SPARK-6772 URL: https://issues.apache.org/jira/browse/SPARK-6772 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.2.0 Reporter: Aditya Parmar Hi all , I am getting an Arrayoutboundsindex error when i try to run a simple filtering colums query on a file with 2.5 lac records.runs fine when running on a file with 2k records . {code} 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 3] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 4] 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at JavaSchemaRDD.scala:42, took 1.914477 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
[jira] [Assigned] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2352: --- Assignee: Apache Spark (was: Bert Greevenbosch) [MLLIB] Add Artificial Neural Network (ANN) to Spark Key: SPARK-2352 URL: https://issues.apache.org/jira/browse/SPARK-2352 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch Assignee: Apache Spark It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-765) Test suite should run Spark example programs
[ https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487060#comment-14487060 ] Yu Ishikawa commented on SPARK-765: --- I'm very sorry. I could run a test in spark.examples. Because my IntelliJ setting was wrong. You know, we don't need to add the dependency. Thanks. Test suite should run Spark example programs Key: SPARK-765 URL: https://issues.apache.org/jira/browse/SPARK-765 Project: Spark Issue Type: New Feature Components: Examples Reporter: Josh Rosen The Spark test suite should also run each of the Spark example programs (the PySpark suite should do the same). This should be done through a shell script or other mechanism to simulate the environment setup used by end users that run those scripts. This would prevent problems like SPARK-764 from making it into releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6806) SparkR examples in programming guide
[ https://issues.apache.org/jira/browse/SPARK-6806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-6806: - Assignee: Davies Liu SparkR examples in programming guide Key: SPARK-6806 URL: https://issues.apache.org/jira/browse/SPARK-6806 Project: Spark Issue Type: New Feature Components: Documentation, SparkR Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Add R examples for Spark Core and DataFrame programming guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6801) spark-submit is not able to identify Alive Master in case of multiple master
[ https://issues.apache.org/jira/browse/SPARK-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6801. -- Resolution: Duplicate spark-submit is not able to identify Alive Master in case of multiple master Key: SPARK-6801 URL: https://issues.apache.org/jira/browse/SPARK-6801 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: pankaj Hi While submitting application using command /bin/spark-submit --class SparkAggregator.java --deploy-mode cluster --supervise --master spark://host1:7077 getting error Can only accept driver submissions in ALIVE state. Current state: STANDBY. if i try giving all possible masters in --master like below command /bin/spark-submit --class SparkAggregator.java --deploy-mode cluster --supervise --master spark://host1:port1,host2:port2 it doesn't allow that. Thanks Pankaj -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487093#comment-14487093 ] cynepia commented on SPARK-3947: Can someone update on where do we stand on this issue? Also, if this would also be supported beyond SQL for dataframes. [Spark SQL] UDAF Support Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pei-Lun Lee Assignee: Venkata Ramana G Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6811) Building binary R packages for SparkR
[ https://issues.apache.org/jira/browse/SPARK-6811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-6811: - Target Version/s: 1.4.0 Building binary R packages for SparkR - Key: SPARK-6811 URL: https://issues.apache.org/jira/browse/SPARK-6811 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Blocker We should figure out how to distribute binary packages for SparkR as a part of the release process. R packages for Windows might need to be built separately and we could offer a separate download link for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6804) System.exit(1) on error
[ https://issues.apache.org/jira/browse/SPARK-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6804. -- Resolution: Duplicate System.exit(1) on error --- Key: SPARK-6804 URL: https://issues.apache.org/jira/browse/SPARK-6804 Project: Spark Issue Type: Improvement Reporter: Alberto We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: {code} else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } {code} IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6811) Building binary R packages for SparkR
Shivaram Venkataraman created SPARK-6811: Summary: Building binary R packages for SparkR Key: SPARK-6811 URL: https://issues.apache.org/jira/browse/SPARK-6811 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Blocker We should figure out how to distribute binary packages for SparkR as a part of the release process. R packages for Windows might need to be built separately and we could offer a separate download link for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6812) filter() on DataFrame does not work as expected
Davies Liu created SPARK-6812: - Summary: filter() on DataFrame does not work as expected Key: SPARK-6812 URL: https://issues.apache.org/jira/browse/SPARK-6812 Project: Spark Issue Type: Bug Components: SparkR Reporter: Davies Liu Priority: Blocker {code} filter(df, df$age 21) Error in filter(df, df$age 21) : no method for coercing this S4 class to a vector {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487934#comment-14487934 ] Marcelo Vanzin commented on SPARK-6229: --- Hi Jeff, Those kinds of differences are exactly why I think all this channel setup code should be handled by config options instead of having client code register bootstraps and other things explicitly. I went with the latter approach to keep the changes smaller (and even then they're still pretty intrusive). Aaron doesn't seem to be a fan of that (config-based) approach, though. For the SSL case, I haven't really looked at your code nor am I familiar with how to use SSL in netty. Do you have to set it up as soon as the server starts listening? (In which case you'd need some new `createServer()` method for SSL.) Or can it be set up after the connection happens? (In which case, the TransportServerBootstrap interface I added should suffice - just register the handler when doBootstrap() is called.) Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487222#comment-14487222 ] cynepia commented on SPARK-3947: Hi Takeshi San, Thanks for the quick response. I would like to know, if there are any active discussions on the topic. While we refactor the interface for aggregates, We should keep in mind the DataFrame, GroupedData, and possibly also sql.dataframe.Column, which looks different from pandas.Series and isn't currently supporting any aggregations. We would be happy to participate and contribute, if there are any discussions on the same. [Spark SQL] UDAF Support Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pei-Lun Lee Assignee: Venkata Ramana G Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487793#comment-14487793 ] Tathagata Das commented on SPARK-6691: -- Yes, I complete agree to the motivation. There is a need for backpressure. I have a design myself but I had not written up a design doc. I think looking through your design it will help me get more concrete ideas and together we can come up with a good design. Thanks :) Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6816) Add SparkConf API to configure SparkR
Shivaram Venkataraman created SPARK-6816: Summary: Add SparkConf API to configure SparkR Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486980#comment-14486980 ] Tathagata Das commented on SPARK-6691: -- [~jerryshao] This is a very good attempt! This is a good first attempt! However, from my experience in rate controlling matters, it can be quite tricky to balance the stability of the system, with high throughput. So any attempt needs to be done carefully with some basic theoretically analysis of how the system will behave in different scenarios (e.g. when processing load increases with same data rate or vice versa, maybe when cluster size decreases due to failures, etc.) I would like to see that sort of analysis in the design doc. Beyond that I will spend time thinking about the pros and cons of this approach. Nonetheless, thank you for initiating this. Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6819) Support nested types in SparkR DataFrame
Shivaram Venkataraman created SPARK-6819: Summary: Support nested types in SparkR DataFrame Key: SPARK-6819 URL: https://issues.apache.org/jira/browse/SPARK-6819 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman ArrayType, MapType and StructureType We can infer the correct schema for them, but the serialization can not handle them well. From Davies in https://sparkr.atlassian.net/browse/SPARKR-230 ArrayType could be c(ab, ba), we will serialize it as character. We does not support deserialize env in R (serialize Map[] in Scala). So all the three do not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6799) Add dataframe examples for SparkR
Shivaram Venkataraman created SPARK-6799: Summary: Add dataframe examples for SparkR Key: SPARK-6799 URL: https://issues.apache.org/jira/browse/SPARK-6799 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical We should add more data frame usage examples for SparkR . This can be similar to the python examples at https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2673) Improve Enable to attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2673: -- Description: In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) Multi Executors can run on the same machine so each executor should open individual debug ports. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. was: In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) It's difficult for Executors running on the same machine to open debug port because we can only pass same JVM options to all executors. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. Improve Enable to attach Debugger to Executors easily - Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) Multi Executors can run on the same machine so each executor should open individual debug ports. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6803) Support SparkR Streaming
Hao created SPARK-6803: -- Summary: Support SparkR Streaming Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487872#comment-14487872 ] Tathagata Das commented on SPARK-6803: -- I know of many use cases where python is desired language for using streaming. Especially in the devops domain (devops love python, as I have heard), streaming machine data and processing them. I do not have any knowledge about the need for writing streaming applications in R. None the less it is very cool. :D BTW, the main challenge in the building python API for streaming was to make the streaming scheduler in Java call back into Python to run an arbitrary python function (a RDD-to-RDD transformation). Setting up this callback through Py4j was interesting. I am curious to know how this was solved with R in this prototype. [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2673) Improve Enable to attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2673: -- Component/s: Web UI Improve Enable to attach Debugger to Executors easily - Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) It's difficult for Executors running on the same machine to open debug port because we can only pass same JVM options to all executors. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)
Shivaram Venkataraman created SPARK-6823: Summary: Add a model.matrix like capability to DataFrames (modelDataFrame) Key: SPARK-6823 URL: https://issues.apache.org/jira/browse/SPARK-6823 Project: Spark Issue Type: New Feature Components: ML, SparkR Reporter: Shivaram Venkataraman Currently Mllib modeling tools work only with double data. However, data tables in practice often have a set of categorical fields (factors in R), that need to be converted to a set of 0/1 indicator variables (making the data actually used in a modeling algorithm completely numeric). In R, this is handled in modeling functions using the model.matrix function. Similar functionality needs to be available within Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2673) Improve Enable to attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2673: --- Assignee: Apache Spark Improve Enable to attach Debugger to Executors easily - Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Kousuke Saruta Assignee: Apache Spark In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) Multi Executors can run on the same machine so each executor should open individual debug ports. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2673) Improve Enable to attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486897#comment-14486897 ] Apache Spark commented on SPARK-2673: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/5437 Improve Enable to attach Debugger to Executors easily - Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) Multi Executors can run on the same machine so each executor should open individual debug ports. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2673) Improve Enable to attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2673: --- Assignee: (was: Apache Spark) Improve Enable to attach Debugger to Executors easily - Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) Multi Executors can run on the same machine so each executor should open individual debug ports. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6822) lapplyPartition passes empty list to function
Shivaram Venkataraman created SPARK-6822: Summary: lapplyPartition passes empty list to function Key: SPARK-6822 URL: https://issues.apache.org/jira/browse/SPARK-6822 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman I have an rdd containing two elements, as expected or as shown by a collect. When I call lapplyPartition on it with a function that prints its arguments in stderr, the function gets called three times, the first two with the expected arguments and the third with an empty list as argument. I was wondering if that's a bug or if there are conditions under which that's possible. I apologize I don't have a simple test case ready yet. I run into this potential bug developing a separate package, plyrmr. If you are willing to install it, the test case is very simple. The rdd that creates this problem is a result of a join, but I couldn't replicate the problem using a plain vanilla join. Example from Antonio on SparkR JIRA: I don't have time to try any harder to repro this without plyrmr. For the record this is the example {code} library(plyrmr) plyrmr.options(backend = spark) df1 = mtcars[1:4,] df2 = mtcars[3:6,] w = as.data.frame(gapply(merge(input(df1), input(df2)), identity)) {code} the gapply is implemented with a lapplyPartition in most cases. The merge with a join. as.data.frame with a collect. The join has an arbitrary argument of 4 partitions. If I turn that down to 2L, the problem disappears. I will check in a version with a workaround in place but a debugging statement will leave a record in stderr whenever the workaround kicks in, so that we can track it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6805) ML Pipeline API in SparkR
Xiangrui Meng created SPARK-6805: Summary: ML Pipeline API in SparkR Key: SPARK-6805 URL: https://issues.apache.org/jira/browse/SPARK-6805 Project: Spark Issue Type: Umbrella Components: ML, SparkR Reporter: Xiangrui Meng Assignee: Xiangrui Meng SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6817) DataFrame UDFs in R
Shivaram Venkataraman created SPARK-6817: Summary: DataFrame UDFs in R Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6767) Documentation error in Spark SQL Readme file
[ https://issues.apache.org/jira/browse/SPARK-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6767. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Assignee: Tijo Thomas Documentation error in Spark SQL Readme file Key: SPARK-6767 URL: https://issues.apache.org/jira/browse/SPARK-6767 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Assignee: Tijo Thomas Priority: Trivial Fix For: 1.3.2, 1.4.0 Error in Spark SQL Documentation file . The sample script for SQL DSL throwing below error scala query.where('key 30).select(avg('key)).collect() console:43: error: value is not a member of Symbol query.where('key 30).select(avg('key)).collect() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3276: --- Assignee: (was: Apache Spark) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487073#comment-14487073 ] Apache Spark commented on SPARK-3276: - User 'emres' has created a pull request for this issue: https://github.com/apache/spark/pull/5438 Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6804) System.exit(1) on error
Alberto created SPARK-6804: -- Summary: System.exit(1) on error Key: SPARK-6804 URL: https://issues.apache.org/jira/browse/SPARK-6804 Project: Spark Issue Type: Improvement Reporter: Alberto We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hao updated SPARK-6803: --- Summary: [SparkR] Support SparkR Streaming (was: Support SparkR Streaming) [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
[ https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670 ] Sandy Ryza commented on SPARK-6735: --- Hi [~twinkle], can you submit the PR against the main Spark project. Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. --- Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
[ https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487670#comment-14487670 ] Sandy Ryza edited comment on SPARK-6735 at 4/9/15 5:06 PM: --- Hi [~twinkle], can you submit the PR against the main Spark project? was (Author: sandyr): Hi [~twinkle], can you submit the PR against the main Spark project. Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. --- Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6821) Refactor SerDe API in SparkR to be more developer friendly
Shivaram Venkataraman created SPARK-6821: Summary: Refactor SerDe API in SparkR to be more developer friendly Key: SPARK-6821 URL: https://issues.apache.org/jira/browse/SPARK-6821 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Shivaram Venkataraman The existing SerDe API we use in the SparkR JVM backend is limited and not very easy to use. We should refactor it to make it use more of Scala's type system and also allow extensions for user-defined S3 or S4 types in R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6813) SparkR style guide
Shivaram Venkataraman created SPARK-6813: Summary: SparkR style guide Key: SPARK-6813 URL: https://issues.apache.org/jira/browse/SPARK-6813 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman We should develop a SparkR style guide document based on the some of the guidelines we use and some of the best practices in R. Some examples of R style guide are: http://r-pkgs.had.co.nz/r.html#style http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html A related issue is to work on a automatic style checking tool. https://github.com/jimhester/lintr seems promising We could have a R style guide based on the one from google [1], and adjust some of them with the conversation in Spark: 1. Line Length: maximum 100 characters 2. no limit on function name (API should be similar as in other languages) 3. Allow S4 objects/methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields
[ https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487010#comment-14487010 ] Stefano Parmesan commented on SPARK-6677: - Uhm, don't know what to say. Let's try with this: I've created a docker that reproduces the issue, its available here: https://github.com/armisael/SPARK-6677 I tested it on three different machines, and the issue appeared on all of them. Can you give it a try? pyspark.sql nondeterministic issue with row fields -- Key: SPARK-6677 URL: https://issues.apache.org/jira/browse/SPARK-6677 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Environment: spark version: spark-1.3.0-bin-hadoop2.4 python version: Python 2.7.6 operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux Reporter: Stefano Parmesan Assignee: Davies Liu Labels: pyspark, row, sql The following issue happens only when running pyspark in the python interpreter, it works correctly with spark-submit. Reading two json files containing objects with a different structure leads sometimes to the definition of wrong Rows, where the fields of a file are used for the other one. I was able to write a sample code that reproduce this issue one out of three times; the code snippet is available at the following link, together with some (very simple) data samples: https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6803) Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487800#comment-14487800 ] Shivaram Venkataraman commented on SPARK-6803: -- Thanks [~hlin09]. The PySpark design is a good starting point for the design. However I'm wondering if there are any R-specific paradigms or operations we should be supporting . Do you know of any existing R packages that do streaming / time-series analysis ? [~tdas] [~davies] It'll also be great to know if we have any feedback from streaming users in Python about the API ? Support SparkR Streaming Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2673) Improve Enable to attach Debugger to Executors easily
[ https://issues.apache.org/jira/browse/SPARK-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2673: -- Summary: Improve Enable to attach Debugger to Executors easily (was: Improve Spark so that we can attach Debugger to Executors easily) Improve Enable to attach Debugger to Executors easily - Key: SPARK-2673 URL: https://issues.apache.org/jira/browse/SPARK-2673 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kousuke Saruta In current implementation, we are difficult to attach debugger to each Executor in the cluster. There are reasons as follows. 1) It's difficult for Executors running on the same machine to open debug port because we can only pass same JVM options to all executors. 2) Even if we can open unique debug port to each Executors running on the same machine, it's a bother to check debug port of each executor. To solve those problem, I think following 2 improvement is needed. 1) Enable executor to open unique debug port on a machine. 2) Expand WebUI to be able to show debug ports opening in each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487878#comment-14487878 ] Jeffrey Turpin commented on SPARK-6229: --- Hey Marcelo, I have been working on SPARK-6373 and have reviewed you pull request and merged into my wip https://github.com/turp1twin/spark/tree/ssl-shuffle. I tried to follow the general design pattern that I discussed with Aaron Davidson, by having a single EncryptionHandler interface and implementations for both SSL and SASL Encryption. One issue I faced is that the timing of adding the appropriate encryption handlers differs for SSL and SASL. For SSL, I need to add the SslHandler to the Netty pipeline before the connection is made, and for SASL, it looks like you add it during the TransportClient/Server Bootstrap process which occurs after the initial connection. Anyways, I haven't created a pull request yet and am waiting on some more feedback... If you have some time perhaps you can give me your thoughts... Some commits of interest... https://github.com/apache/spark/commit/ab8743f6ac707060cbae63bdf491723709fe32f3 https://github.com/apache/spark/commit/9527aef89b1bbc80a22337552dd54af936aa1094 Cheers, Jeff Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6810) Performance benchmarks for SparkR
Shivaram Venkataraman created SPARK-6810: Summary: Performance benchmarks for SparkR Key: SPARK-6810 URL: https://issues.apache.org/jira/browse/SPARK-6810 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical We should port some performance benchmarks from spark-perf to SparkR for tracking performance regressions / improvements. https://github.com/databricks/spark-perf/tree/master/pyspark-tests has a list of PySpark performance benchmarks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
Micael Capitão created SPARK-6800: - Summary: Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results. Key: SPARK-6800 URL: https://issues.apache.org/jira/browse/SPARK-6800 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 8.1, Derby, Spark 1.3.0 CDH5.4.0, Scala 2.10 Reporter: Micael Capitão Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7 (8) age = 32 AND age 36,8 (9) age = 36,9 {code} The last condition ignores the upper bound and the other ones may result in repeated rows being read. Using the JdbcRDD (and converting it to a DataFrame) I would have something like this: {code} val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl), SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10, rs = (rs.getInt(1), rs.getString(2), rs.getInt(3))) val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE) people.show() {code} Resulting in: {noformat} PERSON_ID NAMEAGE 3 Ana Rita Costa 12 5 Miguel Costa15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 {noformat} Which is correct! Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following: {code} (0) age = 0 AND age = 3 (1) age = 4 AND age = 7 (2) age = 8 AND age = 11 (3) age = 12 AND age = 15 (4) age = 16 AND age = 19 (5) age = 20 AND age = 23 (6) age = 24 AND age = 27 (7) age = 28 AND age = 31 (8) age = 32 AND age = 35 (8) age = 36 AND age = 40 {code} This is the behaviour I was expecting from the Spark SQL version. The Spark SQL version is buggy, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488104#comment-14488104 ] Jeffrey Turpin commented on SPARK-6229: --- Hey Marcelo, So what I have done is to overload the TransportContext constructor, adding a constructor that takes an instance of the TransportEncryptionHandler interface: {code:title=TransportContext.java|linenumbers=false|language=java} public TransportContext( TransportConf conf, RpcHandler appRpcHandler, TransportEncryptionHandler encryptionHandler) { this.conf = conf; this.appRpcHandler = appRpcHandler; this.decoder = new MessageDecoder(); if (encryptionHandler != null) { this.encryptionHandler = encryptionHandler; } else { this.encryptionHandler = new NoEncryptionHandler(); } this.encoder = (this.encryptionHandler.isEnabled() ? new SslMessageEncoder() : new MessageEncoder()); } {code} This way the method existing method signatures for createServer and createClientFactory don't change. To facilitate this I also added a constructor to the TransportClientFactory class and modified the constructor for the TransportServer class, to also take a TransportEncryptionHandler instance In the TransportClientFactory case I need to add the Netty SslHandler before the connection occurs, which can be done by calling the _addToPipeline_ method of the TransportEncryptionHandler interface: {code:title=TransportClientFactory.java|linenumbers=false|language=java} private void initHandler( final Bootstrap bootstrap, final AtomicReferenceTransportClient clientRef, final AtomicReferenceChannel channelRef) { bootstrap.handler(new ChannelInitializerSocketChannel() { @Override protected void initChannel(SocketChannel ch) throws Exception { TransportChannelHandler clientHandler = context.initializePipeline(ch); encryptionHandler.addToPipeline(ch.pipeline(), true); clientRef.set(clientHandler.getClient()); channelRef.set(ch); } }); } {code} This _initHandler_ method is called just before connection is made. In addition the TransportEncryptionHandler interface has an _onConnect_ method to allow a post connect initialization to occur, which in the SSL case, is to allow the handshake process to complete, which is a blocking operation. This could be possibly done in a custom TransportClientBootstrap implementation, but the method signature of _doBootstrap_ would have to change to allow for this. As for the TransportServer, the Netty SslHandler must be added to the pipeline before the server binds to a port and starts listening for connections. Again, in this case, this could be done in a TransportServerBootstrap implementation, but the method signature of _doBootstrap_ would have to change (or we would need to add another method) to allow for this... Thoughts? Jeff Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486794#comment-14486794 ] yangping wu commented on SPARK-6770: Ok, Thank you very much for your reply. I will try to use pure Spark Streaming program and use pure scala jdbc to write data to mysql. DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)
[jira] [Resolved] (SPARK-6751) Spark History Server support multiple application attempts
[ https://issues.apache.org/jira/browse/SPARK-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-6751. -- Resolution: Duplicate Dup of SPARK-4705 Spark History Server support multiple application attempts -- Key: SPARK-6751 URL: https://issues.apache.org/jira/browse/SPARK-6751 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Thomas Graves Spark on Yarn supports running multiple application attempts (configurable number) in case the first (or second..) attempts fail. The Spark History server only supports one history file though. Under the default configs it keeps the first attempts history file. You can set the undocumented config spark.eventLog.overwrite to allow the follow on attempts to overwrite the first attempts history file. Note that in spark 1.2 not having the overwrite config set causes any following attempts to actually fail to run, in spark 1.3 they run and you just see a warning at the end of the attempts. It would be really nice to have an option that keeps all the attempts history files. This way a user can go back and look at each one individually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5960: --- Assignee: Apache Spark (was: Chris Fregly) Allow AWS credentials to be passed to KinesisUtils.createStream() - Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly Assignee: Apache Spark While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6820) Convert NAs to null type in SparkR DataFrames
Shivaram Venkataraman created SPARK-6820: Summary: Convert NAs to null type in SparkR DataFrames Key: SPARK-6820 URL: https://issues.apache.org/jira/browse/SPARK-6820 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman While converting RDD or local R DataFrame to a SparkR DataFrame we need to handle missing values or NAs. We should convert NAs to SparkSQL's null type to handle the conversion correctly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6806) SparkR examples in programming guide
Davies Liu created SPARK-6806: - Summary: SparkR examples in programming guide Key: SPARK-6806 URL: https://issues.apache.org/jira/browse/SPARK-6806 Project: Spark Issue Type: New Feature Components: Documentation, SparkR Reporter: Davies Liu Priority: Blocker Add R examples for Spark Core and DataFrame programming guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark
Davies Liu created SPARK-6807: - Summary: Merge recent changes in SparkR-pkg into Spark Key: SPARK-6807 URL: https://issues.apache.org/jira/browse/SPARK-6807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker There are a few of new features happened on SparkR-pkg while merging, we should pull them all in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488245#comment-14488245 ] Nicholas Chammas commented on SPARK-6636: - [~aasted] - Can you elaborate on this? I haven't used private-network-only security groups before. Why wouldn't the IP address work if that kind of security group is used? Just curious since, naively speaking, the public IP and public DNS name should always be interchangeable. Use public DNS hostname everywhere in spark_ec2.py -- Key: SPARK-6636 URL: https://issues.apache.org/jira/browse/SPARK-6636 Project: Spark Issue Type: Bug Components: EC2 Reporter: Matt Aasted Assignee: Matt Aasted Priority: Minor Fix For: 1.3.2, 1.4.0 The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address of the instances. This breaks the script for users who are deploying the cluster with a private-network-only security group. The fix is to use public_dns_name in the remaining place. I am submitting a pull-request alongside this bug report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487278#comment-14487278 ] Takeshi Yamamuro commented on SPARK-3947: - Sorry, but Im not sure about the issue of that. SPARK-4233 just simplifies and bug-fixes the interface of Aggregate. If you'd like to discuss the topic, ISTM you need to make a new jira ticket about that. [Spark SQL] UDAF Support Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pei-Lun Lee Assignee: Venkata Ramana G Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2451) Enable to load config file for Akka
[ https://issues.apache.org/jira/browse/SPARK-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487386#comment-14487386 ] Andrea Peruffo commented on SPARK-2451: --- As per: https://issues.apache.org/jira/browse/SPARK-4669 Enable to load config file for Akka --- Key: SPARK-2451 URL: https://issues.apache.org/jira/browse/SPARK-2451 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kousuke Saruta Priority: Minor In current implementation, we cannot let Akka to load config file. Sometimes we want to use custom config file for Akka. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6804) System.exit(1) on error
[ https://issues.apache.org/jira/browse/SPARK-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alberto updated SPARK-6804: --- Description: We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: {code} else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } {code} IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. was: We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. System.exit(1) on error --- Key: SPARK-6804 URL: https://issues.apache.org/jira/browse/SPARK-6804 Project: Spark Issue Type: Improvement Reporter: Alberto We are developing a web application that is using Spark under the hood. Testing our app we have found out that when our spark master is not up and running and we try to connect with it, Spark is killing our app. We've been having a look at the code and we have noticed that the TaskSchedulerImpl class is just killing the JVM and our web application is obviously also killed. See following the code snippet I am talking about: {code} else { // No task sets are active but we still got an error. Just exit since this // must mean the error is during registration. // It might be good to do something smarter here in the future. logError(Exiting due to error from cluster scheduler: + message) System.exit(1) } {code} IMHO this guy should not invoke System.exit(1). Instead, it should throw an exception so the applications will be able to handle the error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2352: --- Assignee: Bert Greevenbosch (was: Apache Spark) [MLLIB] Add Artificial Neural Network (ANN) to Spark Key: SPARK-2352 URL: https://issues.apache.org/jira/browse/SPARK-2352 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch Assignee: Bert Greevenbosch It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486810#comment-14486810 ] Guoqiang Li commented on SPARK-3937: The bug seems to be caused by {{spark.storage.memoryFraction 0.2}}. {{spark.storage.memoryFraction 0.4}} won't appear the bug. These may be related with the size of the RDD. Unsafe memory access inside of Snappy library - Key: SPARK-3937 URL: https://issues.apache.org/jira/browse/SPARK-3937 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Patrick Wendell This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't have much information about this other than the stack trace. However, it was concerning enough I figured I should post it. {code} java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code org.xerial.snappy.SnappyNative.rawUncompress(Native Method) org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) org.xerial.snappy.Snappy.uncompress(Snappy.java:480) org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2205: Priority: Critical (was: Minor) Unnecessary exchange operators in a join on multiple tables with the same join key. --- Key: SPARK-2205 URL: https://issues.apache.org/jira/browse/SPARK-2205 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical {code} hql(select * from src x join src y on (x.key=y.key) join src z on (y.key=z.key)) SchemaRDD[1] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] HashJoin [key#6], [key#8], BuildRight Exchange (HashPartitioning [key#6], 200) HashJoin [key#4], [key#6], BuildRight Exchange (HashPartitioning [key#4], 200) HiveTableScan [key#4,value#5], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#6], 200) HiveTableScan [key#6,value#7], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#8], 200) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), None {code} However, this is fine... {code} hql(select * from src x join src y on (x.key=y.key) join src z on (x.key=z.key)) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[5] at RDD at SchemaRDD.scala:100 == Query Plan == Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] HashJoin [key#26], [key#30], BuildRight HashJoin [key#26], [key#28], BuildRight Exchange (HashPartitioning [key#26], 200) HiveTableScan [key#26,value#27], (MetastoreRelation default, src, Some(x)), None Exchange (HashPartitioning [key#28], 200) HiveTableScan [key#28,value#29], (MetastoreRelation default, src, Some(y)), None Exchange (HashPartitioning [key#30], 200) HiveTableScan [key#30,value#31], (MetastoreRelation default, src, Some(z)), None {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6816) Add SparkConf API to configure SparkR
[ https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488237#comment-14488237 ] Shivaram Venkataraman commented on SPARK-6816: -- Comments from SparkR JIRA Shivaram Venkataraman added a comment - 14/Feb/15 10:32 AM I looked at this recently and I think the existing arguments to `sparkR.init` pretty much cover all the options that are exposed in SparkConf. We could split things out of the function arguments into a separate SparkConf object (something like PySpark https://github.com/apache/spark/blob/master/python/pyspark/conf.py) but the setter-methods don't translate very well to the style we use in SparkR. For example it would be something like setAppName(setMaster(conf, local), SparkR) instead of conf.setMaster().setAppName() The other thing brought up by this JIRA is that we should parse arguments passed to spark-submit or set in spark-defaults.conf. I think this should automatically happen with SPARKR-178 Sun Rui Zongheng Yang Any thoughts on this ? concretevitamin Zongheng Yang added a comment - 15/Feb/15 12:07 PM I'm +1 on not using the builder pattern in R. What about using a named list or an environment to simulate a SparkConf? For example, users can write something like: {code} conf - list(spark.master = local[2], spark.executor.memory = 12g) conf $spark.master [1] local[2] $spark.executor.memory [1] 12g {code} and pass the named list to `sparkR.init()`. shivaram Shivaram Venkataraman added a comment - 15/Feb/15 5:50 PM I think the named list might be okay, (one thing is that we will have nested named lists for things like executorEnv). However I am not sure if named lists are better than just passing named arguments to the `sparkR.init`. I guess the better way to ask my question is what functionality do we want to provide to the users – Right now users can pretty much set anything they want in the SparkConf using sparkR.init One functionality that is missing is printing the conf and say inspecting what config variables are set. We could say add a getConf(sc) which returns a named list to provide this feature. Is there any other functionality we need ? concretevitamin Zongheng Yang added a comment - 21/Feb/15 3:22 PM IMO using a named list provides more flexibility: it's ordinary data that users can operate/transform on. Using only parameter-passing in the constructor locks users in operating on code instead of data. It'd also be easier to just return the saved named list if we're going to implement getConf()? Some relevant discussions: https://aphyr.com/posts/321-builders-vs-option-maps shivaram Shivaram Venkataraman added a comment - 22/Feb/15 4:33 PM Hmm okay - named lists are not quite the same as option maps though.To move forward it'll be good to see how the new API functions we want on the R side should look like. Lets keep this discussion open but I'm going to change the priority / description (we are already able to read in spark-defaults.conf now that SPARKR-178 has been merged). Add SparkConf API to configure SparkR - Key: SPARK-6816 URL: https://issues.apache.org/jira/browse/SPARK-6816 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Minor Right now the only way to configure SparkR is to pass in arguments to sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python to make configuration easier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486782#comment-14486782 ] Saisai Shao commented on SPARK-6770: From my understanding, the basic scenario of your code is trying to put the Kafka data into database using JDBC, and you want to leverage SparkSQL for easy implementation. I think if you want to use checkpoint file to recover from driver failure, it would be better to write a pure Spark Streaming program, the Spark Streaming's checkpointing mechanism only guarantee streaming's related metadata to write and recover. The more you use third-party tools, the less it can recover from current mechanism. DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at
[jira] [Commented] (SPARK-2949) SparkContext does not fate-share with ActorSystem
[ https://issues.apache.org/jira/browse/SPARK-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487467#comment-14487467 ] Andrea Peruffo commented on SPARK-2949: --- Which version of Spark? ActorSystem has a method registerOnTermination that can trigger callbacks on proper shutdown, by the way I've seen issues on netty shutdown under akka, the only way I've found to live with it is to await polling until the os port become free. SparkContext does not fate-share with ActorSystem - Key: SPARK-2949 URL: https://issues.apache.org/jira/browse/SPARK-2949 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson It appears that an uncaught fatal error in Spark's Driver ActorSystem does not cause the SparkContext to terminate. We observed an issue in production that caused a PermGen error, but it just kept throwing this error: {code} 14/08/09 15:07:24 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-26] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: PermGen space {code} We should probably do something similar for what we did in the DAGSCheduler and ensure that we call SparkContext#stop() if the entire ActorSystem dies with a fatal error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6343) Make doc more explicit regarding network connectivity requirements
[ https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6343. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Issue resolved by pull request 5382 [https://github.com/apache/spark/pull/5382] Make doc more explicit regarding network connectivity requirements -- Key: SPARK-6343 URL: https://issues.apache.org/jira/browse/SPARK-6343 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Peter Parente Priority: Minor Fix For: 1.3.2, 1.4.0 As a new user of Spark, I read through the official documentation before attempting to stand-up my own cluster and write my own driver application. But only after attempting to run my app remotely against my cluster did I realize that full network connectivity (layer 3) is necessary between my driver program and worker nodes (i.e., my driver was *listening* for connections from my workers). I returned to the documentation to see how I had missed this requirement. On a second read-through, I saw that the doc hints at it in a few places (e.g., [driver config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], [submitting applications suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html]) but never outright says it. I think it would help would-be users better understand how Spark works to state the network connectivity requirements right up-front in the overview section of the doc. I suggest revising the diagram and accompanying text found on the [overview page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]: !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png! so that it depicts at least the directionality of the network connections initiated (perhaps like so): !http://i.imgur.com/2dqGbCr.png! and states that the driver must listen for and accept connections from other Spark components on a variety of ports. Please treat my diagram and text as strawmen: I expect more experienced Spark users and developers will have better ideas on how to convey these requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution
Xinghao Pan created SPARK-6808: -- Summary: Checkpointing after zipPartitions results in NODE_LOCAL execution Key: SPARK-6808 URL: https://issues.apache.org/jira/browse/SPARK-6808 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: EC2 Ubuntu r3.8xlarge machines Reporter: Xinghao Pan I'm encountering a weird issue where a simple iterative zipPartition is PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations after checkpointing. Here's an example snippet of code: var R : RDD[(Long,Int)] = sc.parallelize((0 until numPartitions), numPartitions) .mapPartitions(_ = new Array[(Long,Int)](1000).map(i = (0L,0)).toSeq.iterator).cache() sc.setCheckpointDir(checkpointDir) var iteration = 0 while (iteration 50){ R = R.zipPartitions(R)((x,y) = x).cache() if ((iteration+1) % checkpointIter == 0) R.checkpoint() R.foreachPartition(_ = {}) iteration += 1 } I've also tried to unpersist the old RDDs, and increased spark.locality.wait but nether helps. Strangely, by adding a simple identity map R = R.map(x = x).cache() after the zipPartitions appears to partially mitigate the issue. The problem was originally triggered when I attempted to checkpoint after doing joinVertices in GraphX, but the above example shows that the issue is in Spark core too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3276: --- Assignee: Apache Spark Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Assignee: Apache Spark Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6803) [SparkR] Support SparkR Streaming
[ https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488315#comment-14488315 ] Davies Liu commented on SPARK-6803: --- After a quick look over the prototype, the callback server is sit in another process than the driver, because R does not support multiple threading. This approach will have some limitation, for example, access some shared variables in callback functions. Also, we should have a way to collect the logging from callback server, it's needed when you run the streaming job as a daemon process, with dstream.pprint(). This prototype is pretty cool, it shows that it's doable to have a Streaming API in R, even with some limitations. But the question is that how many user want to do streaming job in R? There will be a lots of effort to make it production ready. Even with Python API, there's lots of work to do, for example, support checkpointing and recovery with HDFS. [SparkR] Support SparkR Streaming - Key: SPARK-6803 URL: https://issues.apache.org/jira/browse/SPARK-6803 Project: Spark Issue Type: New Feature Components: SparkR, Streaming Reporter: Hao Fix For: 1.4.0 Adds R API for Spark Streaming. A experimental version is presented in repo [1]. which follows the PySpark streaming design. Also, this PR can be further broken down into sub task issues. [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6343) Make doc more explicit regarding network connectivity requirements
[ https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6343: - Assignee: Peter Parente Make doc more explicit regarding network connectivity requirements -- Key: SPARK-6343 URL: https://issues.apache.org/jira/browse/SPARK-6343 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Peter Parente Assignee: Peter Parente Priority: Minor Fix For: 1.3.2, 1.4.0 As a new user of Spark, I read through the official documentation before attempting to stand-up my own cluster and write my own driver application. But only after attempting to run my app remotely against my cluster did I realize that full network connectivity (layer 3) is necessary between my driver program and worker nodes (i.e., my driver was *listening* for connections from my workers). I returned to the documentation to see how I had missed this requirement. On a second read-through, I saw that the doc hints at it in a few places (e.g., [driver config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], [submitting applications suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html]) but never outright says it. I think it would help would-be users better understand how Spark works to state the network connectivity requirements right up-front in the overview section of the doc. I suggest revising the diagram and accompanying text found on the [overview page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]: !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png! so that it depicts at least the directionality of the network connections initiated (perhaps like so): !http://i.imgur.com/2dqGbCr.png! and states that the driver must listen for and accept connections from other Spark components on a variety of ports. Please treat my diagram and text as strawmen: I expect more experienced Spark users and developers will have better ideas on how to convey these requirements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487323#comment-14487323 ] Saisai Shao commented on SPARK-6691: Hi [~tdas], thanks a lot for your comments. I think this proposal not only add a new dynamic RateLimiter, but also refactor the previous interface to offer a uniform solution for receiver based input stream as well as direct input stream. Also it keeps the backward compatible, the default logic keeps the same as previous code. So I think it make sense to do such refactor. Regarding to dynamic RateLimiter, I think your suggestion is quite meaningful, we need a good design to balance the stability and throughput, currently my design is simple and straightforward, we still need to polish it. But from my point stability is very important for in-production use compared to throughput, since seldom in-produce use will saturate the network bandwidth of each receiver, but unstable is quite critical. Currently Spark Streaming is vulnerable to processing delay, and this processing delay will be accumulated and hard to recover once we met the ingestion burst, it is quite normal in production environment, especially for online service. So dynamic RateLimiter could well solve this problem, from this point it is quite meaningful. My design of dynamic RateLimiter may not be so sophisticated, I think I bring it here is just to show a possible solution to handle the issues, so we could improve this. I will continue to do some benchmark and research works on different scenarios. Thanks a lot for your suggestions. Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488225#comment-14488225 ] Marko Bonaci commented on SPARK-6646: - Wait a minute, don't postpone this one just yet. Hardest problems often give the biggest yields. Other players in the space, spurred (and a bit frightened) by your announcement, already started acting. Nobody wants to be left behind, so strategies are being worked on: http://app.go.cloudera.com/e/es.aspx?s=1465054361e=177939 bq. Cloudera Wearables ^tm^ Spark 2.0: Rearchitecting Spark for Mobile Platforms Key: SPARK-6646 URL: https://issues.apache.org/jira/browse/SPARK-6646 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Attachments: Spark on Mobile - Design Doc - v1.pdf Mobile computing is quickly rising to dominance, and by the end of 2017, it is estimated that 90% of CPU cycles will be devoted to mobile hardware. Spark’s project goal can be accomplished only when Spark runs efficiently for the growing population of mobile users. Designed and optimized for modern data centers and Big Data applications, Spark is unfortunately not a good fit for mobile computing today. In the past few months, we have been prototyping the feasibility of a mobile-first Spark architecture, and today we would like to share with you our findings. This ticket outlines the technical design of Spark’s mobile support, and shares results from several early prototypes. Mobile friendly version of the design doc: https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487177#comment-14487177 ] Apache Spark commented on SPARK-5960: - User 'mce' has created a pull request for this issue: https://github.com/apache/spark/pull/5439 Allow AWS credentials to be passed to KinesisUtils.createStream() - Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly Assignee: Chris Fregly While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5960: --- Assignee: Chris Fregly (was: Apache Spark) Allow AWS credentials to be passed to KinesisUtils.createStream() - Key: SPARK-5960 URL: https://issues.apache.org/jira/browse/SPARK-5960 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.0 Reporter: Chris Fregly Assignee: Chris Fregly While IAM roles are preferable, we're seeing a lot of cases where we need to pass AWS credentials when creating the KinesisReceiver. Notes: * Make sure we don't log the credentials anywhere * Maintain compatibility with existing KinesisReceiver-based code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6814) Support sorting for any data type in SparkR
Shivaram Venkataraman created SPARK-6814: Summary: Support sorting for any data type in SparkR Key: SPARK-6814 URL: https://issues.apache.org/jira/browse/SPARK-6814 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman Priority: Critical I get various return status == 0 is false and unimplemented type errors trying to get data out of any rdd with top() or collect(). The errors are not consistent. I think spark is installed properly because some operations do work. I apologize if I'm missing something easy or not providing the right diagnostic info – I'm new to SparkR, and this seems to be the only resource for SparkR issues. Some logs: {code} Browse[1] top(estep.rdd, 1L) Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) org.apache.spark.SparkException: R computation failed with Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, localhost): org.apache.spark.SparkException: R computation failed with Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : unimplemented type 'list' in 'orderVector1' Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order Execution halted edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.
[ https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micael Capitão updated SPARK-6800: -- Description: Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7 (8) age = 32 AND age 36,8 (9) age = 36,9 {code} The last condition ignores the upper bound and the other ones may result in repeated rows being read. Using the JdbcRDD (and converting it to a DataFrame) I would have something like this: {code} val jdbcRdd = new JdbcRDD(sc, () = DriverManager.getConnection(jdbcUrl), SELECT * FROM Person WHERE age = ? and age = ?, 0, 40, 10, rs = (rs.getInt(1), rs.getString(2), rs.getInt(3))) val people = jdbcRdd.toDF(PERSON_ID, NAME, AGE) people.show() {code} Resulting in: {noformat} PERSON_ID NAMEAGE 3 Ana Rita Costa 12 5 Miguel Costa15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 {noformat} Which is correct! Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartitions}} I've found it generates the following: {code} (0) age = 0 AND age = 3 (1) age = 4 AND age = 7 (2) age = 8 AND age = 11 (3) age = 12 AND age = 15 (4) age = 16 AND age = 19 (5) age = 20 AND age = 23 (6) age = 24 AND age = 27 (7) age = 28 AND age = 31 (8) age = 32 AND age = 35 (9) age = 36 AND age = 40 {code} This is the behaviour I was expecting from the Spark SQL version. Is the Spark SQL version buggy or this some weird expected behaviour? was: Having a Derby table with people info (id, name, age) defined like this: {code} val jdbcUrl = jdbc:derby:memory:PeopleDB;create=true val conn = DriverManager.getConnection(jdbcUrl) val stmt = conn.createStatement() stmt.execute(CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)) stmt.execute(INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)) stmt.execute(INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)) stmt.execute(INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)) stmt.execute(INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)) stmt.execute(INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)) {code} If I try to read that table from Spark SQL with lower/upper bounds, like this: {code} val people = sqlContext.jdbc(url = jdbcUrl, table = Person, columnName = age, lowerBound = 0, upperBound = 40, numPartitions = 10) people.show() {code} I get this result: {noformat} PERSON_ID NAME AGE 3 Ana Rita Costa 12 5 Miguel Costa 15 6 Anabela Sintra 13 2 Lurdes Pereira 23 4 Armando Pereira 32 1 Armando Carvalho 50 {noformat} Which is wrong, considering the defined upper bound has been ignored (I get a person with age 50!). Digging the code, I've found that in {{JDBCRelation.columnPartition}} the WHERE clauses it generates are the following: {code} (0) age 4,0 (1) age = 4 AND age 8,1 (2) age = 8 AND age 12,2 (3) age = 12 AND age 16,3 (4) age = 16 AND age 20,4 (5) age = 20 AND age 24,5 (6) age = 24 AND age 28,6 (7) age = 28 AND age 32,7
[jira] [Updated] (SPARK-6772) spark sql error when running code on large number of records
[ https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Parmar updated SPARK-6772: - Description: Hi all , I am getting an Arrayoutboundsindex error when i try to run a simple filtering colums query on a file with 2.5 lac records.runs fine when running on a file with 2k records . 15/04/09 12:19:01 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 15/04/09 12:19:01 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes) 15/04/09 12:19:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0): java.lang.ArrayIndexOutOfBoundsException 15/04/09 12:19:01 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes) 15/04/09 12:19:01 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 15/04/09 12:19:01 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes) 15/04/09 12:19:01 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2] 15/04/09 12:19:01 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes) 15/04/09 12:19:01 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 3] 15/04/09 12:19:01 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1349 bytes) 15/04/09 12:19:02 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on executor : java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 4] 15/04/09 12:19:02 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job 15/04/09 12:19:02 INFO TaskSchedulerImpl: Cancelling stage 0 15/04/09 12:19:02 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/04/09 12:19:02 INFO DAGScheduler: Job 0 failed: saveAsTextFile at JavaSchemaRDD.scala:42, took 1.958621 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, ): java.lang.ArrayIndexOutOfBoundsException Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
[jira] [Assigned] (SPARK-6796) Add the batch list to StreamingPage
[ https://issues.apache.org/jira/browse/SPARK-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-6796: Assignee: Shixiong Zhu Add the batch list to StreamingPage --- Key: SPARK-6796 URL: https://issues.apache.org/jira/browse/SPARK-6796 Project: Spark Issue Type: Sub-task Components: Streaming, Web UI Reporter: Shixiong Zhu Assignee: Shixiong Zhu Show the list of active and completed batches the StreamingPage, as the proposed Task 1 in https://docs.google.com/document/d/1-ZjvQ_2thWEQkTxRMHrVdnEI57XTi3wZEBUoqrrDg5c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487103#comment-14487103 ] Sean Owen commented on SPARK-6770: -- That may be so, but it's not obvious that you simply can't use Spark SQL with Streaming recovery. For example, the final error makes it sound like it very nearly works. Perhaps you just need to use a different constructor to specify the SQLConf? maybe this value should be serialized with some object? It might be something that is hard to make work now but I wonder if there is an easy fix to make the SQL objects recoverable. DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at
[jira] [Assigned] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark
[ https://issues.apache.org/jira/browse/SPARK-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6807: --- Assignee: Davies Liu (was: Apache Spark) Merge recent changes in SparkR-pkg into Spark - Key: SPARK-6807 URL: https://issues.apache.org/jira/browse/SPARK-6807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker There are a few of new features happened on SparkR-pkg while merging, we should pull them all in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark
[ https://issues.apache.org/jira/browse/SPARK-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487687#comment-14487687 ] Apache Spark commented on SPARK-6807: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5436 Merge recent changes in SparkR-pkg into Spark - Key: SPARK-6807 URL: https://issues.apache.org/jira/browse/SPARK-6807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker There are a few of new features happened on SparkR-pkg while merging, we should pull them all in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6807) Merge recent changes in SparkR-pkg into Spark
[ https://issues.apache.org/jira/browse/SPARK-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6807: --- Assignee: Apache Spark (was: Davies Liu) Merge recent changes in SparkR-pkg into Spark - Key: SPARK-6807 URL: https://issues.apache.org/jira/browse/SPARK-6807 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker There are a few of new features happened on SparkR-pkg while merging, we should pull them all in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14488128#comment-14488128 ] Shivaram Venkataraman commented on SPARK-6812: -- Hmm - don't we have a unit test for this ? I'm wondering if this is because of the generics not resolving correctly. filter() on DataFrame does not work as expected --- Key: SPARK-6812 URL: https://issues.apache.org/jira/browse/SPARK-6812 Project: Spark Issue Type: Bug Components: SparkR Reporter: Davies Liu Priority: Blocker {code} filter(df, df$age 21) Error in filter(df, df$age 21) : no method for coercing this S4 class to a vector {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3947) [Spark SQL] UDAF Support
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487175#comment-14487175 ] Takeshi Yamamuro commented on SPARK-3947: - See SPARK-4233, we are refactoring the interfaces of Aggregate before it support UDAF. https://issues.apache.org/jira/browse/SPARK-4233 [Spark SQL] UDAF Support Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Pei-Lun Lee Assignee: Venkata Ramana G Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org