[jira] [Created] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets
Tomislav Milinovic created SPARK-5056: - Summary: Implementing Clara k-medoids clustering algorithm for large datasets Key: SPARK-5056 URL: https://issues.apache.org/jira/browse/SPARK-5056 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tomislav Milinovic Priority: Minor There is a specific k-medoids clustering algorithm for large datasets. The algorithm is called Clara in R, and is fully described in chapter 3 of Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L and Rousseeuw, PJ (1990). The algorithm considers sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic. Each sub-dataset is partitioned into k clusters using the same algorithm as in Partinioning around Medoids (PAM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2165) spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext
[ https://issues.apache.org/jira/browse/SPARK-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262881#comment-14262881 ] Apache Spark commented on SPARK-2165: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3878 spark on yarn: add support for setting maxAppAttempts in the ApplicationSubmissionContext - Key: SPARK-2165 URL: https://issues.apache.org/jira/browse/SPARK-2165 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Hadoop 2.x adds support for allowing the application to specify the maximum application attempts. We should add support for it by setting in the ApplicationSubmissionContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5057) Add more details in log when using actor to get infos
[ https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262827#comment-14262827 ] Apache Spark commented on SPARK-5057: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3875 Add more details in log when using actor to get infos - Key: SPARK-5057 URL: https://issues.apache.org/jira/browse/SPARK-5057 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: WangTaoTheTonic Priority: Minor As is used in many cases, it is better for analysis to print contents of message after attempt failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5058) Typos and broken URL
[ https://issues.apache.org/jira/browse/SPARK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262843#comment-14262843 ] Apache Spark commented on SPARK-5058: - User 'sigmoidanalytics' has created a pull request for this issue: https://github.com/apache/spark/pull/3877 Typos and broken URL Key: SPARK-5058 URL: https://issues.apache.org/jira/browse/SPARK-5058 Project: Spark Issue Type: Documentation Components: Streaming Affects Versions: 1.2.0 Reporter: AkhlD Priority: Minor Fix For: 1.2.1 Spark Streaming + Kafka Integration Guide has a broken Examples link. Also project is spelled as projrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5058) Typos and broken URL
[ https://issues.apache.org/jira/browse/SPARK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262842#comment-14262842 ] AkhlD commented on SPARK-5058: -- Created a PR https://github.com/apache/spark/pull/3877 Typos and broken URL Key: SPARK-5058 URL: https://issues.apache.org/jira/browse/SPARK-5058 Project: Spark Issue Type: Documentation Components: Streaming Affects Versions: 1.2.0 Reporter: AkhlD Priority: Minor Fix For: 1.2.1 Spark Streaming + Kafka Integration Guide has a broken Examples link. Also project is spelled as projrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5057) Add more details in log when using actor to get infos
WangTaoTheTonic created SPARK-5057: -- Summary: Add more details in log when using actor to get infos Key: SPARK-5057 URL: https://issues.apache.org/jira/browse/SPARK-5057 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: WangTaoTheTonic Priority: Minor As is used in many cases, it is better for analysis to print contents of message after attempt failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5055) Minor typos on the downloads page
Marko Bonaci created SPARK-5055: --- Summary: Minor typos on the downloads page Key: SPARK-5055 URL: https://issues.apache.org/jira/browse/SPARK-5055 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0 Reporter: Marko Bonaci Priority: Trivial The _Downloads_ page uses the word Chose for present. It should say Choose. http://spark.apache.org/downloads.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5058) Typos and broken URL
AkhlD created SPARK-5058: Summary: Typos and broken URL Key: SPARK-5058 URL: https://issues.apache.org/jira/browse/SPARK-5058 Project: Spark Issue Type: Documentation Components: Streaming Affects Versions: 1.2.0 Reporter: AkhlD Priority: Minor Fix For: 1.2.1 Spark Streaming + Kafka Integration Guide has a broken Examples link. Also project is spelled as projrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263230#comment-14263230 ] Alex Liu commented on SPARK-4943: - Should we also change the signatures of Catalog methods to use {code}tableIdentifier: Seq[String] {code} instead of {code}db: Option[String], tableName: String{code}? {code} def tableExists(db: Option[String], tableName: String): Boolean def lookupRelation( databaseName: Option[String], tableName: String, alias: Option[String] = None): LogicalPlan def registerTable(databaseName: Option[String], tableName: String, plan: LogicalPlan): Unit def unregisterTable(databaseName: Option[String], tableName: String): Unit def unregisterAllTables(): Unit protected def processDatabaseAndTableName( databaseName: Option[String], tableName: String): (Option[String], String) {code} Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263187#comment-14263187 ] Nicholas Chammas commented on SPARK-3821: - I need to brush up on my statistics, but I think the difference between base AMI and Packer AMI is not statistically significant. The benchmark just tested time from instance launch to SSH availability. Nothing was installed or done with the instances after SSH became available. (i.e. I wasn't creating Spark clusters.) I still have to post updated benchmarks for full cluster launches. Is there anything else you wanted to see before reviewing this proposal in more detail? Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263206#comment-14263206 ] Nicholas Chammas commented on SPARK-3821: - I have Packer configured to run {{create_image.sh}}, as well as other scripts I added (e.g. to install Python 2.7), to generate the AMIs I am using. So testing Packer-generated AMIs against manually-generated ones (by running {{create_image.sh}} by hand) should show little difference. Packer is just tooling to automate the application of existing scripts like {{create_image.sh}} towards creating AMIs and other image types like GCE images and Docker images. The goal is to make it easy to generate and update Spark AMIs (and eventually Docker images too) in an automated fashion. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263193#comment-14263193 ] Shivaram Venkataraman commented on SPARK-3821: -- Yeah you are right that the times are pretty close for Packer, base AMI. I was just curious if I was missing some thing. I don't think there is much else I had in mind -- having the full cluster launch times for existing AMI vs. Packer would be good and it would also be good to see how Packer compares to images created using [create_image.sh|https://github.com/mesos/spark-ec2/blob/v4/create_image.sh] Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263181#comment-14263181 ] Shivaram Venkataraman commented on SPARK-3821: -- [~nchammas] Thanks for the benchmark. One thing I am curious about is why the Packer AMI is faster than launching just the base Amazon AMI. Is this because we spend some time installing things on the base AMI that we avoid with Packer ? Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5061) SQLContext: overload createParquetFile
[ https://issues.apache.org/jira/browse/SPARK-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263278#comment-14263278 ] Apache Spark commented on SPARK-5061: - User 'alexbaretta' has created a pull request for this issue: https://github.com/apache/spark/pull/3882 SQLContext: overload createParquetFile -- Key: SPARK-5061 URL: https://issues.apache.org/jira/browse/SPARK-5061 Project: Spark Issue Type: New Feature Reporter: Alex Baretta Overload createParquetFile to support an explicit schema in the form of a StructType object as follows: def createParquetFile(schema: StructType, path: String, allowExisting: Boolean, conf: org.apache.hadoop.conf.Configuration) : SchemaRD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5061) SQLContext: overload createParquetFile
Alex Baretta created SPARK-5061: --- Summary: SQLContext: overload createParquetFile Key: SPARK-5061 URL: https://issues.apache.org/jira/browse/SPARK-5061 Project: Spark Issue Type: New Feature Reporter: Alex Baretta Overload createParquetFile to support an explicit schema in the form of a StructType object as follows: def createParquetFile(schema: StructType, path: String, allowExisting: Boolean, conf: org.apache.hadoop.conf.Configuration) : SchemaRD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5059) list of user's objects in Spark REPL
Tomas Hudik created SPARK-5059: -- Summary: list of user's objects in Spark REPL Key: SPARK-5059 URL: https://issues.apache.org/jira/browse/SPARK-5059 Project: Spark Issue Type: New Feature Components: Spark Shell Reporter: Tomas Hudik Priority: Minor Often user do not remember all objects he has created in Spark REPL (shell). It would be helpful to have an command that would list all such objects. E.g. R is using *ls()* to list all objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263328#comment-14263328 ] Tathagata Das commented on SPARK-4905: -- [~hshreedharan] Can you take a look at this please!! I have been seeing this once in a while. I has seen this when the test sent one message at a time, and to increase the chances of success, i modified the test to send one whole bunch at a time, repeatedly, until all got through or nothing got through. But it still seems to failing. I have no idea why empty string are being send when i trying to send 1, 2, 3, etc. Please take a look. Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[jira] [Updated] (SPARK-5036) Better support sending partial messages in Pregel API
[ https://issues.apache.org/jira/browse/SPARK-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sjk updated SPARK-5036: --- Description: Better support sending partial messages in Pregel API 1. the reqirement In many iterative graph algorithms, only a part of the vertexes (we call them ActiveVertexes) need to send messages to their neighbours in each iteration. In many cases, ActiveVertexes are the vertexes that their attributes do not change between the previous and current iteration. To implement this requirement, we can use Pregel API + a flag (e.g., `bool isAttrChanged`) in each vertex's attribute. However, after `aggregateMessage` or `mapReduceTriplets` of each iteration, we need to reset this flag to the init value in every vertex, which needs a heavy `joinVertices`. We find a more efficient way to meet this requirement and want to discuss it here. Look at a simple example as follows: In i-th iteartion, the previous attribute of each vertex is `Attr` and the newly computed attribute is `NewAttr`: |VID| Attr| NewAttr| Neighbours| |:|:-|:|:--| | 1 | 4| 5| 2, 3 | | 2 | 3| 2| 1, 4 | | 3 | 2| 2| 1, 4 | | 4| 3| 4| 1, 2, 3 | Our requirement is that: 1. Set each vertex's `Attr` to be `NewAttr` in i-th iteration 2. For each vertex whose `Attr!=NewAttr`, send message to its neighbours in the next iteration's `aggregateMessage`. We found it is hard to implement this requirment using current Pregel API efficiently. The reason is that we not only need to perform `pregel()` to compute the `NewAttr` (2) but also need to perform `outJoin()` to satisfy (1). A simple idea is to keep a `isAttrChanged:Boolean` (solution 1) or `flag:Int` (solution 2) in each vertex's attribute. 2. two solution --- 2.1 solution 1: label and reset `isAttrChanged:Boolean` of Vertex Attr ![alt text](s1.jpeg Title) 1. init message by `aggregateMessage` it return a messageRDD 2. `innerJoin` compute the messages on the received vertex, return a new VertexRDD which have the computed value by customed logic function `vprog`, set `isAttrChanged = true` 3. `outerJoinVertices` update the changed vertex to the whole graph. now the graph is new. 4. `aggregateMessage`. it return a messageRDD 5. `joinVertices` reset erery `isAttrChanged` of Vertex attr to false ``` // here reset the isAttrChanged to false g = updateG.joinVertices(updateG.vertices) { (vid, oriVertex, updateGVertex) = updateGVertex.reset() } ``` here need to reset the vertex attribute object's variable as false if don't reset the `isAttrChanged`, it will send message next iteration directly. **result:** * Edge: 890041895 * Vertex: 181640208 * Iterate: 150 times * Cost total: 8.4h * can't run until the 0 message solution 2. color vertex ![alt text](s2.jpeg Title) iterate process: 1. innerJoin `vprog` using as a partial function, looks like `vprog(curIter, _: VertexId, _: VD, _: A)` ` i = i + 1; val curIter = i`. in `vprog`, user can fetch `curIter` and assign to `falg`. 2. outerJoinVertices `graph = graph.outerJoinVertices(changedVerts) { (vid, old, newOpt) = newOpt.getOrElse(old)}.cache()` 3. aggregateMessages sendMsg is partial function, looks like `sendMsg(curIter, _: EdgeContext[VD, ED, A]` **in `sendMsg`, compare `curIter` with `flag`, determine whether sending message** result raw data from * vertex: 181640208 * edge: 890041895 | | iteration average cost | 150 iteration cost | 420 iteration cost | | | - | | | | solution 1 | 188m | 7.8h | cannot finish | | solution 2 | 24 | 1.2h | 3.1h | | compare | 7x | 6.5x | finished in 3.1 | ## the end i think the second solution(Pregel + a flag) is better. this can really support the iterative graph algorithms which only part of the vertexes send messages to their neighbours in each iteration. we shall use it in product environment. pr: https://github.com/apache/spark/pull/3866 EOF was: Better support sending partial messages in Pregel API 1. the reqirement In many iterative graph algorithms, only a part of the vertexes (we call them ActiveVertexes) need to send messages to their neighbours in each iteration. In many cases, ActiveVertexes are the vertexes that their attributes do not change between the previous and current iteration. To implement this requirement, we can use Pregel API + a flag (e.g., `bool isAttrChanged`) in each vertex's attribute. However, after `aggregateMessage` or `mapReduceTriplets` of each iteration, we need to reset this flag to the init value in every vertex, which needs a heavy `joinVertices`. We find a more efficient way to meet this requirement and want to discuss
[jira] [Updated] (SPARK-5063) Raise more helpful errors when RDD actions or transformations are called inside of transformations
[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5063: -- Summary: Raise more helpful errors when RDD actions or transformations are called inside of transformations (was: Raise more helpful errors when SparkContext methods are called inside of transformations) Raise more helpful errors when RDD actions or transformations are called inside of transformations -- Key: SPARK-5063 URL: https://issues.apache.org/jira/browse/SPARK-5063 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think that we should add some logic to attempt to detect these sorts of errors: we can use a DynamicVariable to check whether we're inside a task and throw more useful errors when the RDD constructor is called from inside a task or when the SparkContext job submission methods are called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263287#comment-14263287 ] Kannan Rajah edited comment on SPARK-1529 at 1/2/15 10:53 PM: -- [~lian cheng] [~pwendell] I want to work on this JIRA. It's been a while since there has been any update. So can you please share what the current status is? Has there been a consensus on replacing the file API with a HDFS kind of interface and plugging in the right implementation? I will be looking at the code base in the mean time. was (Author: rkannan82): [~lian cheng] [~pwendell]] I want to work on this JIRA. It's been a while since there has been any update. So can you please share what the current status is? Has there been a consensus on replacing the file API with a HDFS kind of interface and plugging in the right implementation? I will be looking at the code base in the mean time. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263287#comment-14263287 ] Kannan Rajah commented on SPARK-1529: - [~lian cheng] [~pwendell]] I want to work on this JIRA. It's been a while since there has been any update. So can you please share what the current status is? Has there been a consensus on replacing the file API with a HDFS kind of interface and plugging in the right implementation? I will be looking at the code base in the mean time. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function
[ https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263369#comment-14263369 ] Apache Spark commented on SPARK-5062: - User 'shijinkui' has created a pull request for this issue: https://github.com/apache/spark/pull/3883 Pregel use aggregateMessage instead of mapReduceTriplets function - Key: SPARK-5062 URL: https://issues.apache.org/jira/browse/SPARK-5062 Project: Spark Issue Type: Wish Components: GraphX Reporter: sjk Attachments: graphx_aggreate_msg.jpg since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it improve the performance indeed. it's time to replace mapReduceTriplets with aggregateMessage in Pregel. we can discuss it. i have draw a graph of aggregateMessage to show why it can improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function
[ https://issues.apache.org/jira/browse/SPARK-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sjk updated SPARK-5062: --- Attachment: graphx_aggreate_msg.jpg Pregel use aggregateMessage instead of mapReduceTriplets function - Key: SPARK-5062 URL: https://issues.apache.org/jira/browse/SPARK-5062 Project: Spark Issue Type: Wish Components: GraphX Reporter: sjk Attachments: graphx_aggreate_msg.jpg since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it improve the performance indeed. it's time to replace mapReduceTriplets with aggregateMessage in Pregel. we can discuss it. i have draw a graph of aggregateMessage to show why it can improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5062) Pregel use aggregateMessage instead of mapReduceTriplets function
sjk created SPARK-5062: -- Summary: Pregel use aggregateMessage instead of mapReduceTriplets function Key: SPARK-5062 URL: https://issues.apache.org/jira/browse/SPARK-5062 Project: Spark Issue Type: Wish Components: GraphX Reporter: sjk since spark 1.2 introduce aggregateMessage instead of mapReduceTriplets, it improve the performance indeed. it's time to replace mapReduceTriplets with aggregateMessage in Pregel. we can discuss it. i have draw a graph of aggregateMessage to show why it can improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5063) Raise more helpful errors when RDD actions or transformations are called inside of transformations
[ https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263415#comment-14263415 ] Apache Spark commented on SPARK-5063: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3884 Raise more helpful errors when RDD actions or transformations are called inside of transformations -- Key: SPARK-5063 URL: https://issues.apache.org/jira/browse/SPARK-5063 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow: - https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534 - https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399 - https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674 (those are just a sample of the ones that I've answered personally; there are many others). I think we can detect these errors by adding logic to {{RDD}} to check whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to add a better error message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263458#comment-14263458 ] Chip Senkbeil commented on SPARK-4923: -- FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Maven build should keep publishing spark-repl - Key: SPARK-4923 URL: https://issues.apache.org/jira/browse/SPARK-4923 Project: Spark Issue Type: Bug Components: Build, Spark Shell Affects Versions: 1.2.0 Reporter: Peng Cheng Priority: Critical Labels: shell Attachments: SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch Original Estimate: 1h Remaining Estimate: 1h Spark-repl installation and deployment has been discontinued (see SPARK-3452). But its in the dependency list of a few projects that extends its initialization process. Please remove the 'skip' setting in spark-repl and make it an 'official' API to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org