[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265896#comment-14265896 ] Sean Owen commented on SPARK-3452: -- [~aniket] I think that's a little different. You may find the Spark YARN API you want now in spark-network-yarn. Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4585) Spark dynamic executor allocation shouldn't use maxExecutors as initial number
[ https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265915#comment-14265915 ] Lianhui Wang commented on SPARK-4585: - yes, i think initial executors number can be speculated. in most of cases, i think initial executors number is tasks number of first level running stages. Spark dynamic executor allocation shouldn't use maxExecutors as initial number -- Key: SPARK-4585 URL: https://issues.apache.org/jira/browse/SPARK-4585 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Chengxiang Li With SPARK-3174, one can configure a minimum and maximum number of executors for a Spark application on Yarn. However, the application always starts with the maximum. It seems more reasonable, at least for Hive on Spark, to start from the minimum and scale up as needed up to the maximum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5101) Add common ML math functions
[ https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265916#comment-14265916 ] Sean Owen commented on SPARK-5101: -- (Ah, very good point about overflow!) Add common ML math functions Key: SPARK-5101 URL: https://issues.apache.org/jira/browse/SPARK-5101 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: DB Tsai Priority: Minor We can add common ML math functions to MLlib. It may be a little tricky to implement those functions in a numerically stable way. For example, {code} math.log(1 + math.exp(x)) {code} should be implemented as {code} if (x 0) { x + math.log1p(math.exp(-x)) } else { math.log1p(math.exp(x)) } {code} It becomes hard to maintain if we have multiple copies of the correct implementation in the codebase. A good place for those functions could be `mllib.util.MathFunctions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5100) Spark Thrift server monitor page
[ https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-5100: --- Attachment: prototype-screenshot.png Spark Thrift server monitor page Key: SPARK-5100 URL: https://issues.apache.org/jira/browse/SPARK-5100 Project: Spark Issue Type: New Feature Components: SQL, Web UI Reporter: Yi Tian Priority: Critical Attachments: Spark Thrift-server monitor page.pdf, prototype-screenshot.png In the latest Spark release, there is a Spark Streaming tab on the driver web UI, which shows information about running streaming application. It should be helpful for providing a monitor page in Thrift server, because both streaming and Thrift server are long-term applications, and the details of the application do not show on stage page or job page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5100) Spark Thrift server monitor page
[ https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-5100: --- Attachment: (was: Spark Thrift-server monitor page.pdf) Spark Thrift server monitor page Key: SPARK-5100 URL: https://issues.apache.org/jira/browse/SPARK-5100 Project: Spark Issue Type: New Feature Components: SQL, Web UI Reporter: Yi Tian Priority: Critical Attachments: Spark Thrift-server monitor page.pdf, prototype-screenshot.png In the latest Spark release, there is a Spark Streaming tab on the driver web UI, which shows information about running streaming application. It should be helpful for providing a monitor page in Thrift server, because both streaming and Thrift server are long-term applications, and the details of the application do not show on stage page or job page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5100) Spark Thrift server monitor page
[ https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-5100: --- Attachment: Spark Thrift-server monitor page.pdf design doc Spark Thrift server monitor page Key: SPARK-5100 URL: https://issues.apache.org/jira/browse/SPARK-5100 Project: Spark Issue Type: New Feature Components: SQL, Web UI Reporter: Yi Tian Priority: Critical Attachments: Spark Thrift-server monitor page.pdf, prototype-screenshot.png In the latest Spark release, there is a Spark Streaming tab on the driver web UI, which shows information about running streaming application. It should be helpful for providing a monitor page in Thrift server, because both streaming and Thrift server are long-term applications, and the details of the application do not show on stage page or job page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5101) Add common ML math functions
[ https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5101: - Priority: Minor (was: Major) Add common ML math functions Key: SPARK-5101 URL: https://issues.apache.org/jira/browse/SPARK-5101 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Priority: Minor We can add common ML math functions to MLlib. It may be a little tricky to implement those functions in a numerically stable way. For example, {code} math.log(1 + math.exp(x)) {code} should be implemented as {code} if (x 0) { x + math.log1p(math.exp(-x)) } else { math.log1p(math.exp(x)) } {code} It becomes hard to maintain if we have multiple copies of the correct implementation in the codebase. A good place for those functions could be `mllib.util.MathFunctions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaozhong Yang updated SPARK-4850: -- Comment: was deleted (was: https://issues.apache.org/jira/secure/ViewProfile.jspa?name=lian+cheng ) GROUP BY can't work if the schema of SchemaRDD contains struct or array type -- Key: SPARK-4850 URL: https://issues.apache.org/jira/browse/SPARK-4850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 Reporter: Chaozhong Yang Assignee: Cheng Lian Labels: group, sql Original Estimate: 96h Remaining Estimate: 96h Code in Spark Shell as follows: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect {code} Let's look into the schema of `Table` {code} root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) {code} Exception will be throwed: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at
[jira] [Commented] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265849#comment-14265849 ] Chaozhong Yang commented on SPARK-4850: --- https://issues.apache.org/jira/secure/ViewProfile.jspa?name=lian+cheng GROUP BY can't work if the schema of SchemaRDD contains struct or array type -- Key: SPARK-4850 URL: https://issues.apache.org/jira/browse/SPARK-4850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 Reporter: Chaozhong Yang Assignee: Cheng Lian Labels: group, sql Original Estimate: 96h Remaining Estimate: 96h Code in Spark Shell as follows: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect {code} Let's look into the schema of `Table` {code} root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) {code} Exception will be throwed: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22)
[jira] [Created] (SPARK-5101) Add common ML math functions
Xiangrui Meng created SPARK-5101: Summary: Add common ML math functions Key: SPARK-5101 URL: https://issues.apache.org/jira/browse/SPARK-5101 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng We can add common ML math functions to MLlib. It may be a little tricky to implement those functions in a numerically stable way. For example, {code} math.log(1 + math.exp(x)) {code} should be implemented as {code} if (x 0) { x + math.log1p(math.exp(-x)) } else { math.log1p(math.exp(x)) } {code} It becomes hard to maintain if we have multiple copies of the correct implementation in the codebase. A good place for those functions could be `mllib.util.MathFunctions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5101) Add common ML math functions
[ https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5101: - Assignee: DB Tsai Add common ML math functions Key: SPARK-5101 URL: https://issues.apache.org/jira/browse/SPARK-5101 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: DB Tsai Priority: Minor We can add common ML math functions to MLlib. It may be a little tricky to implement those functions in a numerically stable way. For example, {code} math.log(1 + math.exp(x)) {code} should be implemented as {code} if (x 0) { x + math.log1p(math.exp(-x)) } else { math.log1p(math.exp(x)) } {code} It becomes hard to maintain if we have multiple copies of the correct implementation in the codebase. A good place for those functions could be `mllib.util.MathFunctions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265845#comment-14265845 ] Tathagata Das commented on SPARK-4905: -- Any insights yet? Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200) at org.apache.spark.streaming.flume.FlumeStreamSuite.runTest(FlumeStreamSuite.scala:46) at
[jira] [Commented] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type
[ https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265848#comment-14265848 ] Chaozhong Yang commented on SPARK-4850: --- Got it, thanks ! GROUP BY can't work if the schema of SchemaRDD contains struct or array type -- Key: SPARK-4850 URL: https://issues.apache.org/jira/browse/SPARK-4850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2 Reporter: Chaozhong Yang Assignee: Cheng Lian Labels: group, sql Original Estimate: 96h Remaining Estimate: 96h Code in Spark Shell as follows: {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path = path/to/json sqlContext.jsonFile(path).register(Table) val t = sqlContext.sql(select * from Table group by a) t.collect {code} Let's look into the schema of `Table` {code} root |-- a: integer (nullable = true) |-- arr: array (nullable = true) ||-- element: integer (containsNull = false) |-- createdAt: string (nullable = true) |-- f: struct (nullable = true) ||-- __type: string (nullable = true) ||-- className: string (nullable = true) ||-- objectId: string (nullable = true) |-- objectId: string (nullable = true) |-- s: string (nullable = true) |-- updatedAt: string (nullable = true) {code} Exception will be throwed: {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: arr#9, tree: Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14] Subquery TestImport LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], MappedRDD[18] at map at JsonRDD.scala:47 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125) at scala.Option.foreach(Option.scala:236) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at
[jira] [Comment Edited] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
[ https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265304#comment-14265304 ] Tathagata Das edited comment on SPARK-4905 at 1/6/15 8:31 AM: -- What is the reason behind such a behavior where the number of records received is same as sent, but all the records are empty? was (Author: tdas): What is the reason behind such a behavior where the number of records received is same as sent, but all the records are empty? TD Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream - Key: SPARK-4905 URL: https://issues.apache.org/jira/browse/SPARK-4905 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Josh Rosen Labels: flaky-test It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream test might be flaky ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]): {code} Error Message The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). Stacktrace sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 106 times over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100). at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at
[jira] [Commented] (SPARK-4999) No need to put WAL-backed block into block manager by default
[ https://issues.apache.org/jira/browse/SPARK-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265811#comment-14265811 ] Apache Spark commented on SPARK-4999: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/3906 No need to put WAL-backed block into block manager by default - Key: SPARK-4999 URL: https://issues.apache.org/jira/browse/SPARK-4999 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Saisai Shao Currently WAL-backed block is read out from HDFS and put into BlockManger with storage level MEMORY_ONLY_SER by default, since WAL-backed block is already fault-tolerant, no need to put into BlockManger again by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5099) Simplify logistic loss function and fix deviance loss function
Liang-Chi Hsieh created SPARK-5099: -- Summary: Simplify logistic loss function and fix deviance loss function Key: SPARK-5099 URL: https://issues.apache.org/jira/browse/SPARK-5099 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Besides, there is a bug in computing the loss in LogLoss. This pr fixes it too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5099) Simplify logistic loss function and fix deviance loss function
[ https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265828#comment-14265828 ] Apache Spark commented on SPARK-5099: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/3899 Simplify logistic loss function and fix deviance loss function -- Key: SPARK-5099 URL: https://issues.apache.org/jira/browse/SPARK-5099 Project: Spark Issue Type: Bug Reporter: Liang-Chi Hsieh Priority: Minor This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Besides, there is a bug in computing the loss in LogLoss. This pr fixes it too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1600) flaky recovery with file input stream test in streaming.CheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-1600. -- Resolution: Fixed Fix Version/s: 1.3.0 flaky recovery with file input stream test in streaming.CheckpointSuite - Key: SPARK-1600 URL: https://issues.apache.org/jira/browse/SPARK-1600 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.2.0 Reporter: Nan Zhu Labels: flaky-test Fix For: 1.3.0 the case recovery with file input stream.recovery with file input stream sometimes fails when the Jenkins is very busy with an unrelated change I have met it for 3 times, I also saw it in other places, the latest example is in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/ where the modification is just in YARN related files I once reported in dev mail list: http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1600) flaky recovery with file input stream test in streaming.CheckpointSuite
[ https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1600: - Affects Version/s: (was: 1.3.0) flaky recovery with file input stream test in streaming.CheckpointSuite - Key: SPARK-1600 URL: https://issues.apache.org/jira/browse/SPARK-1600 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.2.0 Reporter: Nan Zhu Labels: flaky-test Fix For: 1.3.0 the case recovery with file input stream.recovery with file input stream sometimes fails when the Jenkins is very busy with an unrelated change I have met it for 3 times, I also saw it in other places, the latest example is in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/ where the modification is just in YARN related files I once reported in dev mail list: http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo
Daniel Darabos created SPARK-5102: - Summary: CompressedMapStatus needs to be registered with Kryo Key: SPARK-5102 URL: https://issues.apache.org/jira/browse/SPARK-5102 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Daniel Darabos Priority: Minor After upgrading from Spark 1.1.0 to 1.2.0 I got this exception: {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is not registered: org.apache.spark.scheduler.CompressedMapStatus Note: To register this class use: kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with Kryo. I think this should be done in {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are not expected to be sent over the wire. (Maybe I'm doing something wrong?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on
[ https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266156#comment-14266156 ] Aniket Bhatnagar commented on SPARK-3452: - Ok.. I'll test this out by adding dependency to spark-network-yarn and see how it goes. Fingers crossed! Maven build should skip publishing artifacts people shouldn't depend on --- Key: SPARK-3452 URL: https://issues.apache.org/jira/browse/SPARK-3452 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical Fix For: 1.2.0 I think it's easy to do this by just adding a skip configuration somewhere. We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5099) Simplify logistic loss function
[ https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-5099: --- Description: This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. was: This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Besides, there is a bug in computing the loss in LogLoss. This pr fixes it too. Simplify logistic loss function --- Key: SPARK-5099 URL: https://issues.apache.org/jira/browse/SPARK-5099 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5099) Simplify logistic loss function
[ https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-5099: --- Issue Type: Improvement (was: Bug) Simplify logistic loss function --- Key: SPARK-5099 URL: https://issues.apache.org/jira/browse/SPARK-5099 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Besides, there is a bug in computing the loss in LogLoss. This pr fixes it too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4366) Aggregation Optimization
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266124#comment-14266124 ] Cheng Hao commented on SPARK-4366: -- [~marmbrus] I've uploaded an draft design doc for the UDAF interface, let me know if you have any concerns or confusing. Aggregation Optimization Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Attachments: aggregatefunction_v1.pdf This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4366) Aggregation Optimization
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-4366: - Attachment: aggregatefunction_v1.pdf Draft Design Doc. Aggregation Optimization Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Attachments: aggregatefunction_v1.pdf This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5103) Add Functionality to Pass Config Options to KeyConverter and ValueConverter in PySpark
Brett Meyer created SPARK-5103: -- Summary: Add Functionality to Pass Config Options to KeyConverter and ValueConverter in PySpark Key: SPARK-5103 URL: https://issues.apache.org/jira/browse/SPARK-5103 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 1.2.0 Reporter: Brett Meyer Priority: Minor Currently when using the provided PySpark loaders and using a KeyConverter or ValueConverter class, there is no way to pass in additional information to the converter classes. Would like to add functionality to pass in options either through configuration that can be set to the SparkContext, or through parameters that can be passed to the KeyConverter and ValueConverter classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5110) Spark-on-Yarn does not work on windows platform
Zhan Zhang created SPARK-5110: - Summary: Spark-on-Yarn does not work on windows platform Key: SPARK-5110 URL: https://issues.apache.org/jira/browse/SPARK-5110 Project: Spark Issue Type: Bug Reporter: Zhan Zhang There are some scripts laucnching am and executor in spark-on-yarn does not work with windows platform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5112) Expose SizeEstimator as a developer API
[ https://issues.apache.org/jira/browse/SPARK-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266646#comment-14266646 ] Apache Spark commented on SPARK-5112: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/3913 Expose SizeEstimator as a developer API --- Key: SPARK-5112 URL: https://issues.apache.org/jira/browse/SPARK-5112 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD. -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
Patrick Wendell created SPARK-5113: -- Summary: Audit and document use of hostnames and IP addresses in Spark Key: SPARK-5113 URL: https://issues.apache.org/jira/browse/SPARK-5113 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Critical Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. In some cases, that hostname is used as the bind interface also (e.g. I think this happens in the connection manager and possibly akka). In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general
[ https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Willard updated SPARK-5075: Summary: Memory Leak when repartitioning SchemaRDD or running queries in general (was: Memory Leak when repartitioning SchemaRDD from JSON) Memory Leak when repartitioning SchemaRDD or running queries in general --- Key: SPARK-5075 URL: https://issues.apache.org/jira/browse/SPARK-5075 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.2.0 Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge Reporter: Brad Willard Labels: ec2, json, parquet, pyspark, repartition, s3 I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb from pyspark.sql import SQLContext sql_context = SQLContext(sc) rdd = sql_context.jsonFile('s3c://some_path') rdd = rdd.repartition(256) rdd.saveAsParquetFile('hdfs://some_path') In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors. https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5110) Spark-on-Yarn does not work on windows platform
[ https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266599#comment-14266599 ] Sean Owen commented on SPARK-5110: -- [~zhanzhang] are you intending to add any detail to these JIRAs? This looks like a duplicate of at least one of: https://issues.apache.org/jira/browse/SPARK-5034 https://issues.apache.org/jira/browse/SPARK-1825 https://issues.apache.org/jira/browse/SPARK-2221 Spark-on-Yarn does not work on windows platform --- Key: SPARK-5110 URL: https://issues.apache.org/jira/browse/SPARK-5110 Project: Spark Issue Type: Bug Reporter: Zhan Zhang There are some scripts laucnching am and executor in spark-on-yarn does not work with windows platform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
Zhan Zhang created SPARK-5111: - Summary: HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Reporter: Zhan Zhang Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general
[ https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Willard updated SPARK-5075: Description: I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb from pyspark.sql import SQLContext sql_context = SQLContext(sc) rdd = sql_context.jsonFile('s3c://some_path') rdd = rdd.repartition(256) rdd.saveAsParquetFile('hdfs://some_path') In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors. https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png I'm also seeing a similar memory leak behavior when running repeated queries on a dataset. rdd = sql_context.parquetFile('hdfs://some_path') rdd.registerTempTable('events') sql_context.sql( anything ) sql_context.sql( anything ) sql_context.sql( anything ) sql_context.sql( anything ) will result in a memory usage pattern of. http://cl.ly/image/180y2D3d1A0X It seems like intermediate results are not being garbage collected or something. Eventually I have to kill my session to keep running queries. was: I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb from pyspark.sql import SQLContext sql_context = SQLContext(sc) rdd = sql_context.jsonFile('s3c://some_path') rdd = rdd.repartition(256) rdd.saveAsParquetFile('hdfs://some_path') In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors. https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png Memory Leak when repartitioning SchemaRDD or running queries in general --- Key: SPARK-5075 URL: https://issues.apache.org/jira/browse/SPARK-5075 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.2.0 Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge Reporter: Brad Willard Labels: ec2, json, parquet, pyspark, repartition, s3 I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb from pyspark.sql import SQLContext sql_context = SQLContext(sc) rdd = sql_context.jsonFile('s3c://some_path') rdd = rdd.repartition(256) rdd.saveAsParquetFile('hdfs://some_path') In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors. https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png I'm also seeing a similar memory leak behavior when running repeated queries on a dataset. rdd = sql_context.parquetFile('hdfs://some_path') rdd.registerTempTable('events') sql_context.sql( anything ) sql_context.sql( anything ) sql_context.sql( anything ) sql_context.sql( anything ) will result in a memory usage pattern of. http://cl.ly/image/180y2D3d1A0X It seems like intermediate results are not being garbage collected or something. Eventually I have to kill my session to keep running queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5107) A trick log info for the start of Receiver
[ https://issues.apache.org/jira/browse/SPARK-5107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266597#comment-14266597 ] Apache Spark commented on SPARK-5107: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/3912 A trick log info for the start of Receiver -- Key: SPARK-5107 URL: https://issues.apache.org/jira/browse/SPARK-5107 Project: Spark Issue Type: Improvement Components: Streaming Reporter: uncleGen Priority: Trivial Receiver will register itself whenever it begins to start. But, it is trick to log the same information. Especially, at the preStart(), it will also register itself. Just like the receiver has started twice. Just like: !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/3.JPG! We can log the information more clearly. Like the attempt times to start. Of course, nothing matters performance or use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5112) Expose SizeEstimator as a developer API
Sandy Ryza created SPARK-5112: - Summary: Expose SizeEstimator as a developer API Key: SPARK-5112 URL: https://issues.apache.org/jira/browse/SPARK-5112 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD. -the Tuning Spark page This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
[ https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5113: --- Description: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier (akka supports only supplying a single name which it uses both as the bind interface and as the actor identifier). In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. The best outcome would be to have three configs that can be set on each machine: {code} SPARK_LOCAL_IP # Ip address we bind to for all services SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within the cluster SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the cluster (e.g. the UI) {code} It's not clear how easily we can support that scheme while providing backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - it's just an alias for what is now SPARK_PUBLIC_DNS. was: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier (akka supports only supplying a single name which it uses both as the bind interface and as the actor identifier). In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. Audit and document use of hostnames and IP addresses in Spark - Key: SPARK-5113 URL: https://issues.apache.org/jira/browse/SPARK-5113 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Critical Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier (akka supports only supplying a single name which it uses both as the bind interface and as the actor identifier). In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. The best outcome would be to have three configs that can be set on each machine: {code}
[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
[ https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5113: --- Description: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. was: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. In some cases, that hostname is used as the bind interface also (e.g. I think this happens in the connection manager and possibly akka). In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. Audit and document use of hostnames and IP addresses in Spark - Key: SPARK-5113 URL: https://issues.apache.org/jira/browse/SPARK-5113 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Critical Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
[ https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5113: --- Description: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier (akka supports only supplying a single name which it uses both as the bind interface and as the actor identifier). In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. was: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier. In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. Audit and document use of hostnames and IP addresses in Spark - Key: SPARK-5113 URL: https://issues.apache.org/jira/browse/SPARK-5113 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Critical Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier (akka supports only supplying a single name which it uses both as the bind interface and as the actor identifier). In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general
[ https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Willard updated SPARK-5075: Labels: ec2 json memory-leak memory_leak parquet pyspark repartition s3 (was: ec2 json parquet pyspark repartition s3) Memory Leak when repartitioning SchemaRDD or running queries in general --- Key: SPARK-5075 URL: https://issues.apache.org/jira/browse/SPARK-5075 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.2.0 Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge Reporter: Brad Willard Labels: ec2, json, memory-leak, memory_leak, parquet, pyspark, repartition, s3 I'm trying to repartition a json dataset for better cpu optimization and save in parquet format for better performance. The Json dataset is about 200gb from pyspark.sql import SQLContext sql_context = SQLContext(sc) rdd = sql_context.jsonFile('s3c://some_path') rdd = rdd.repartition(256) rdd.saveAsParquetFile('hdfs://some_path') In ganglia when the dataset first loads it's about 200G in memory which is expected. However once it attempts the repartition, it balloons over 2.5x in memory which is never returned making any subsequent operations fail from memory errors. https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png I'm also seeing a similar memory leak behavior when running repeated queries on a dataset. rdd = sql_context.parquetFile('hdfs://some_path') rdd.registerTempTable('events') sql_context.sql( anything ) sql_context.sql( anything ) sql_context.sql( anything ) sql_context.sql( anything ) will result in a memory usage pattern of. http://cl.ly/image/180y2D3d1A0X It seems like intermediate results are not being garbage collected or something. Eventually I have to kill my session to keep running queries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4159) Maven build doesn't run JUnit test suites
[ https://issues.apache.org/jira/browse/SPARK-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4159: -- Target Version/s: 1.1.1, 1.0.3, 1.2.1 Fix Version/s: 1.3.0 Assignee: Sean Owen Labels: backport-needed (was: ) Maven build doesn't run JUnit test suites - Key: SPARK-4159 URL: https://issues.apache.org/jira/browse/SPARK-4159 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Sean Owen Priority: Critical Labels: backport-needed Fix For: 1.3.0 It turns out our Maven build isn't running any Java test suites, and likely hasn't ever. After some fishing I believe the following is the issue. We use scalatest [1] in our maven build which, by default can't automatically detect JUnit tests. Scalatest will allow you to enumerate a list of suites via JUnitClasses, but I cant' find a way for it to auto-detect all JUnit tests. It turns out this works in SBT because of our use of the junit-interface[2] which does this for you. An okay fix for this might be to simply enable the normal (surefire) maven tests in addition to our scalatest in the maven build. The only thing to watch out for is that they don't overlap in some way. We'd also have to copy over environment variables, memory settings, etc to that plugin. [1] http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin [2] https://github.com/sbt/junit-interface -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark
[ https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5113: --- Description: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier. In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. was: Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. Audit and document use of hostnames and IP addresses in Spark - Key: SPARK-5113 URL: https://issues.apache.org/jira/browse/SPARK-5113 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Priority: Critical Spark has multiple network components that start servers and advertise their network addresses to other processes. We should go through each of these components and make sure they have consistent and/or documented behavior wrt (a) what interface(s) they bind to and (b) what hostname they use to advertise themselves to other processes. We should document this clearly and explain to people what to do in different cases (e.g. EC2, dockerized containers, etc). When Spark initializes, it will search for a network interface until it finds one that is not a loopback address. Then it will do a reverse DNS lookup for a hostname associated with that interface. Then the network components will use that hostname to advertise the component to other processes. That hostname is also the one used for the akka system identifier. In some cases, that hostname is used as the bind hostname also (e.g. I think this happens in the connection manager and possibly akka) - which will likely internally result in a re-resolution of this to an IP address. In other cases (the web UI and netty shuffle) we seem to bind to all interfaces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266729#comment-14266729 ] Apache Spark commented on SPARK-5108: - User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/3914 Need to make jackson dependency version consistent with hadoop-2.6.0. - Key: SPARK-5108 URL: https://issues.apache.org/jira/browse/SPARK-5108 Project: Spark Issue Type: Bug Components: Build Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5018) Make MultivariateGaussian public
[ https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266907#comment-14266907 ] Travis Galoppo commented on SPARK-5018: --- Please assign this ticket to me. Make MultivariateGaussian public Key: SPARK-5018 URL: https://issues.apache.org/jira/browse/SPARK-5018 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Critical MultivariateGaussian is currently private[ml], but it would be a useful public class. This JIRA will require defining a good public API for distributions. This JIRA will be needed for finalizing the GaussianMixtureModel API, which should expose MultivariateGaussian instances instead of the means and covariances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266909#comment-14266909 ] Travis Galoppo commented on SPARK-5019: --- This really can't be completed until MultivariateGaussian is made public (SPARK-5018). Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5018) Make MultivariateGaussian public
[ https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5018: - Assignee: Travis Galoppo Make MultivariateGaussian public Key: SPARK-5018 URL: https://issues.apache.org/jira/browse/SPARK-5018 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Priority: Critical MultivariateGaussian is currently private[ml], but it would be a useful public class. This JIRA will require defining a good public API for distributions. This JIRA will be needed for finalizing the GaussianMixtureModel API, which should expose MultivariateGaussian instances instead of the means and covariances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5114) Should Evaluator by a PipelineStage
[ https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5114: - Component/s: ML Description: Pipelines can currently contain Estimators and Transformers. Question for debate: Should Pipelines be able to contain Evaluators? Pros: * Evaluators take input datasets with particular schema, which should perhaps be checked before running a Pipeline. Cons: * Evaluators do not transform datasets. They produce a scalar (or a few values), which makes it hard to say how they fit into a Pipeline or a PipelineModel. Target Version/s: 1.3.0 Affects Version/s: 1.2.0 Summary: Should Evaluator by a PipelineStage (was: Should ) Should Evaluator by a PipelineStage --- Key: SPARK-5114 URL: https://issues.apache.org/jira/browse/SPARK-5114 Project: Spark Issue Type: Question Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Pipelines can currently contain Estimators and Transformers. Question for debate: Should Pipelines be able to contain Evaluators? Pros: * Evaluators take input datasets with particular schema, which should perhaps be checked before running a Pipeline. Cons: * Evaluators do not transform datasets. They produce a scalar (or a few values), which makes it hard to say how they fit into a Pipeline or a PipelineModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266986#comment-14266986 ] Kai Sasaki commented on SPARK-5019: --- I'm sorry for submitting premature PR. Is it OK to ask anyone to assign some tickets I want to take from next time? Because I seem no rights to assign issues to myself. I want to check (SPARK-5018) and review it. Sorry for disturbing you [~tgaloppo] Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266997#comment-14266997 ] Joseph K. Bradley commented on SPARK-5019: -- No problem; thanks for your understanding. If you'd like to work on an item, I'd post a comment on the JIRA saying that you want to work on it asking an admin to assign it to you. Even if an admin does not see it immediately, anyone else who wants to work on the JIRA will see your comment. Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5114) Should
Joseph K. Bradley created SPARK-5114: Summary: Should Key: SPARK-5114 URL: https://issues.apache.org/jira/browse/SPARK-5114 Project: Spark Issue Type: Question Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules
Ryan Williams created SPARK-5115: Summary: Intellij fails to find hadoop classes in Spark yarn modules Key: SPARK-5115 URL: https://issues.apache.org/jira/browse/SPARK-5115 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Ryan Williams Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs, someone with more Maven/IntelliJ fu may need to chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules
[ https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267035#comment-14267035 ] Apache Spark commented on SPARK-5115: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/3917 Intellij fails to find hadoop classes in Spark yarn modules - Key: SPARK-5115 URL: https://issues.apache.org/jira/browse/SPARK-5115 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Ryan Williams Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs, someone with more Maven/IntelliJ fu may need to chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules
[ https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267036#comment-14267036 ] Apache Spark commented on SPARK-5115: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/3918 Intellij fails to find hadoop classes in Spark yarn modules - Key: SPARK-5115 URL: https://issues.apache.org/jira/browse/SPARK-5115 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Ryan Williams Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs, someone with more Maven/IntelliJ fu may need to chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267061#comment-14267061 ] Travis Galoppo edited comment on SPARK-5019 at 1/7/15 12:24 AM: No problem,[~lewuathe] ... I have just started work on SPARK-5018. If you would like to re-visit this ticket once that one is complete, that would be great! was (Author: tgaloppo): No problem,@lewuathe ... I have just started work on SPARK-5018. If you would like to re-visit this ticket once that one is complete, that would be great! Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules
[ https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-5115: - Description: Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs to address this issue, someone with more Maven/IntelliJ fu may need to chime in. was: Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs, someone with more Maven/IntelliJ fu may need to chime in. Intellij fails to find hadoop classes in Spark yarn modules - Key: SPARK-5115 URL: https://issues.apache.org/jira/browse/SPARK-5115 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Ryan Williams Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them:
[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules
[ https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267047#comment-14267047 ] Ryan Williams commented on SPARK-5115: -- FTR the IntelliJ problem I'm referring to is simply its current inability to resolve imports (see first image in OP) and the red errors / loss of various code-inspection functionality that results. This is not an issue related to compilations failing within IntelliJ, which it sounds like your comments about profile-setting would be relevant to. Intellij fails to find hadoop classes in Spark yarn modules - Key: SPARK-5115 URL: https://issues.apache.org/jira/browse/SPARK-5115 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Ryan Williams Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs to address this issue, someone with more Maven/IntelliJ fu may need to chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules
[ https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267074#comment-14267074 ] Sean Owen commented on SPARK-5115: -- I just deleted my IntelliJ project config for Spark ({{.idea/}}, all {{.iml}}) and reimported the Maven build from master, and chose all defaults. The build is fine for me* and {{yarn/}} is not even a module, since the {{yarn}} profile is not on by default, which turns on this module. So I think you have somehow activated the YARN-related module but it takes another step or two to do that in the build -- activate profile {{yarn}} and {{hadoop-2.4}} for example is what I do. If I turn these profiles, reimport the Maven project, and rebuild in IntelliJ, {{yarn}} becomes a module and it builds OK for me. I hope that resolves the compile error you see and gets rid of the red. This is why I'm saying I don't see that there's a basic developer sanity problem to fix. The build seems to do what it's supposed to when put into IntelliJ. To me, separately, the idea of updating the Hadoop default to something more modern (Hadoop 2.4? YARN-enabled?) sounds fine on its own, not because it solves a problem but just because it feels like a more sensible default in 2015. * I find I have to press the 'generate sources' button in IJ before the first build or else Make won't find the generated sources in the flume-sink module, but I think that's not related here ** Hm, I see some crazy-looking compiler errors from the Catalyst DSL package the first time I compile, that then go away, but I also think that's something unrelated or to do with code generation Intellij fails to find hadoop classes in Spark yarn modules - Key: SPARK-5115 URL: https://issues.apache.org/jira/browse/SPARK-5115 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Ryan Williams Intellij's parsing of Spark's POMs works like a charm for the most part, however it fails to resolve the hadoop and yarn dependencies in the Spark {{yarn}} and {{network/yarn}} modules. Imports and later references to imported classes show up as errors, e.g. !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png! Opening the module settings, we see that IntelliJ is looking for version {{1.0.4}} of [each yarn JAR that the Spark YARN module depends on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56], and failing to find them: !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png! This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122]. AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and may just be an accident of history; people typically select a Maven profile when building Spark that matches the version of Hadoop that they intend to run with. This suggests one possible fix: bump the default Hadoop version to = 2. I've tried this locally and it resolves Intellij's difficulties with the yarn and network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] does this. Another fix would be to declare a {{hadoop.version}} property in {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR #3918|https://github.com/apache/spark/pull/3918] does that. It is more obvious to me in the former case that the existing rules that govern what {{hadoop.version}} the YARN dependencies inherit will still apply. For the latter, or potentially other ways to configure IntelliJ / Spark's POMs to address this issue, someone with more Maven/IntelliJ fu may need to chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.
[ https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-5108: -- Summary: Need to make jackson dependency version consistent with hadoop-2.6.0. (was: Need to add more jackson dependency for hadoop-2.6.0 support.) Need to make jackson dependency version consistent with hadoop-2.6.0. - Key: SPARK-5108 URL: https://issues.apache.org/jira/browse/SPARK-5108 Project: Spark Issue Type: Bug Components: Build Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266740#comment-14266740 ] Joseph K. Bradley commented on SPARK-5019: -- [~lewuathe] I would recommend getting this JIRA assigned to you before submitting a PR, to make sure no one else is working on it. In particular, I believe [~tgaloppo] was planning on handling this JIRA after his current PR [https://github.com/apache/spark/pull/3871]. Can you please coordinate with him on how to divide up the JIRAs? Thanks! Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5110) Spark-on-Yarn does not work on windows platform
[ https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266744#comment-14266744 ] Zhan Zhang commented on SPARK-5110: --- You are right. I will make this duplicated. Spark-on-Yarn does not work on windows platform --- Key: SPARK-5110 URL: https://issues.apache.org/jira/browse/SPARK-5110 Project: Spark Issue Type: Bug Reporter: Zhan Zhang There are some scripts laucnching am and executor in spark-on-yarn does not work with windows platform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5110) Spark-on-Yarn does not work on windows platform
[ https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang closed SPARK-5110. - Resolution: Duplicate Spark-on-Yarn does not work on windows platform --- Key: SPARK-5110 URL: https://issues.apache.org/jira/browse/SPARK-5110 Project: Spark Issue Type: Bug Reporter: Zhan Zhang There are some scripts laucnching am and executor in spark-on-yarn does not work with windows platform. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266897#comment-14266897 ] Apache Spark commented on SPARK-4924: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/3916 Factor out code to launch Spark applications into a separate library Key: SPARK-4924 URL: https://issues.apache.org/jira/browse/SPARK-4924 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Marcelo Vanzin Attachments: spark-launcher.txt One of the questions we run into rather commonly is how to start a Spark application from my Java/Scala program?. There currently isn't a good answer to that: - Instantiating SparkContext has limitations (e.g., you can only have one active context at the moment, plus you lose the ability to submit apps in cluster mode) - Calling SparkSubmit directly is doable but you lose a lot of the logic handled by the shell scripts - Calling the shell script directly is doable, but sort of ugly from an API point of view. I think it would be nice to have a small library that handles that for users. On top of that, this library could be used by Spark itself to replace a lot of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5050) Add unit test for sqdist
[ https://issues.apache.org/jira/browse/SPARK-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5050. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3869 [https://github.com/apache/spark/pull/3869] Add unit test for sqdist Key: SPARK-5050 URL: https://issues.apache.org/jira/browse/SPARK-5050 Project: Spark Issue Type: Test Reporter: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 Related to #3643. Follow the previous suggestion to add unit test for sqdist in VectorsSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5050) Add unit test for sqdist
[ https://issues.apache.org/jira/browse/SPARK-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5050: - Assignee: Liang-Chi Hsieh Add unit test for sqdist Key: SPARK-5050 URL: https://issues.apache.org/jira/browse/SPARK-5050 Project: Spark Issue Type: Test Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 Related to #3643. Follow the previous suggestion to add unit test for sqdist in VectorsSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266589#comment-14266589 ] Apache Spark commented on SPARK-4296: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3910 Throw Expression not in GROUP BY when using same expression in group by clause and select clause --- Key: SPARK-4296 URL: https://issues.apache.org/jira/browse/SPARK-4296 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Shixiong Zhu Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 When the input data has a complex structure, using same expression in group by clause and select clause will throw Expression not in GROUP BY. {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Birthday(date: String) case class Person(name: String, birthday: Birthday) val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), Person(Jim, Birthday(1980-02-28 people.registerTempTable(people) val year = sqlContext.sql(select count(*), upper(birthday.date) from people group by upper(birthday.date)) year.collect {code} Here is the plan of year: {code:java} SchemaRDD[3] at RDD at SchemaRDD.scala:105 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree: Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3] Subquery people LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36 {code} The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`. Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266598#comment-14266598 ] Apache Spark commented on SPARK-5019: - User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/3911 Update GMM API to use MultivariateGaussian -- Key: SPARK-5019 URL: https://issues.apache.org/jira/browse/SPARK-5019 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Blocker The GaussianMixtureModel API should expose MultivariateGaussian instances instead of the means and covariances. This should be fixed as soon as possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5101) Add common ML math functions
[ https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266755#comment-14266755 ] Apache Spark commented on SPARK-5101: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/3915 Add common ML math functions Key: SPARK-5101 URL: https://issues.apache.org/jira/browse/SPARK-5101 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: DB Tsai Priority: Minor We can add common ML math functions to MLlib. It may be a little tricky to implement those functions in a numerically stable way. For example, {code} math.log(1 + math.exp(x)) {code} should be implemented as {code} if (x 0) { x + math.log1p(math.exp(-x)) } else { math.log1p(math.exp(x)) } {code} It becomes hard to maintain if we have multiple copies of the correct implementation in the codebase. A good place for those functions could be `mllib.util.MathFunctions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5017) GaussianMixtureEM should use SVD for Gaussian initialization
[ https://issues.apache.org/jira/browse/SPARK-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5017. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3871 [https://github.com/apache/spark/pull/3871] GaussianMixtureEM should use SVD for Gaussian initialization Key: SPARK-5017 URL: https://issues.apache.org/jira/browse/SPARK-5017 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Fix For: 1.3.0 GaussianMixtureEM effectively does 2 matrix decompositions in Gaussian initialization (pinv and det). Instead, it should do SVD and use that result to compute the inverse and det. This will also prevent failure when the matrix is singular. Note: Breeze pinv fails when the matrix is singular: [https://github.com/scalanlp/breeze/issues/304] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib
[ https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267176#comment-14267176 ] Apache Spark commented on SPARK-5116: - User 'coderxiang' has created a pull request for this issue: https://github.com/apache/spark/pull/3919 Add extractor for SparseVector and DenseVector in MLlib Key: SPARK-5116 URL: https://issues.apache.org/jira/browse/SPARK-5116 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Priority: Minor Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } with extractor it is: vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib
[ https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-5116: -- Description: Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=RankingMetrics.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=RankingMetrics.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} was: Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } with extractor it is: vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } Add extractor for SparseVector and DenseVector in MLlib Key: SPARK-5116 URL: https://issues.apache.org/jira/browse/SPARK-5116 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Priority: Minor Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=RankingMetrics.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=RankingMetrics.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib
[ https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-5116: -- Description: Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=A.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=B.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} was: Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=RankingMetrics.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=RankingMetrics.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} Add extractor for SparseVector and DenseVector in MLlib Key: SPARK-5116 URL: https://issues.apache.org/jira/browse/SPARK-5116 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Priority: Minor Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: {code:title=A.scala|borderStyle=solid} vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } {code} with extractor it is: {code:title=B.scala|borderStyle=solid} vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5118) Create table test stored as parquet as select ... report error
guowei created SPARK-5118: - Summary: Create table test stored as parquet as select ... report error Key: SPARK-5118 URL: https://issues.apache.org/jira/browse/SPARK-5118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: guowei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5121) Stored as parquet doens't support the CTAS
XiaoJing wang created SPARK-5121: Summary: Stored as parquet doens't support the CTAS Key: SPARK-5121 URL: https://issues.apache.org/jira/browse/SPARK-5121 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: hive-0.13.1 Reporter: XiaoJing wang Fix For: 1.2.0 In the CTAS, stored as parquet is an unsupported Hive feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5120) Output the thread name in log4j.properties
[ https://issues.apache.org/jira/browse/SPARK-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangTaoTheTonic updated SPARK-5120: --- Issue Type: Improvement (was: Bug) Output the thread name in log4j.properties -- Key: SPARK-5120 URL: https://issues.apache.org/jira/browse/SPARK-5120 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor In most case the thread name is very useful to analyse running job, it is better to log it out in log4j properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5120) Output the thread name in log4j.properties
WangTaoTheTonic created SPARK-5120: -- Summary: Output the thread name in log4j.properties Key: SPARK-5120 URL: https://issues.apache.org/jira/browse/SPARK-5120 Project: Spark Issue Type: Bug Components: Deploy Reporter: WangTaoTheTonic Priority: Minor In most case the thread name is very useful to analyse running job, it is better to log it out in log4j properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5090) The improvement of python converter for hbase
[ https://issues.apache.org/jira/browse/SPARK-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267184#comment-14267184 ] Apache Spark commented on SPARK-5090: - User 'GenTang' has created a pull request for this issue: https://github.com/apache/spark/pull/3920 The improvement of python converter for hbase - Key: SPARK-5090 URL: https://issues.apache.org/jira/browse/SPARK-5090 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 1.2.0 Reporter: Gen TANG Labels: hbase, python Fix For: 1.2.1 Original Estimate: 168h Remaining Estimate: 168h The python converter `HBaseResultToStringConverter` provided in the HBaseConverter.scala returns only the value of first column in the result. It limits the utility of this converter, because it returns only one value per row(perhaps there are several version in hbase) and moreover it loses the other information of record, such as column:cell, timestamp. Here we would like to propose an improvement about python converter which returns all the records in the results (in a single string) with more complete information. We would like also make some improvements for hbase_inputformat.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5118) Create table test stored as parquet as select ... report error
[ https://issues.apache.org/jira/browse/SPARK-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267224#comment-14267224 ] Apache Spark commented on SPARK-5118: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/3921 Create table test stored as parquet as select ... report error Key: SPARK-5118 URL: https://issues.apache.org/jira/browse/SPARK-5118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: guowei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5120) Output the thread name in log4j.properties
[ https://issues.apache.org/jira/browse/SPARK-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267256#comment-14267256 ] Apache Spark commented on SPARK-5120: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3922 Output the thread name in log4j.properties -- Key: SPARK-5120 URL: https://issues.apache.org/jira/browse/SPARK-5120 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Priority: Minor In most case the thread name is very useful to analyse running job, it is better to log it out in log4j properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib
Shuo Xiang created SPARK-5116: - Summary: Add extractor for SparseVector and DenseVector in MLlib Key: SPARK-5116 URL: https://issues.apache.org/jira/browse/SPARK-5116 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Priority: Minor Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we need to use: vec match { case dv: DenseVector = val values = dv.values ... case sv: SparseVector = val indices = sv.indices val values = sv.values val size = sv.size ... } with extractor it is: vec match { case DenseVector(values) = ... case SparseVector(size, indices, values) = ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5117) Hive Generic UDFs don't cast correctly
Michael Armbrust created SPARK-5117: --- Summary: Hive Generic UDFs don't cast correctly Key: SPARK-5117 URL: https://issues.apache.org/jira/browse/SPARK-5117 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Cheng Hao Priority: Blocker Here's a test cast that is failing in master: {code} createQueryTest(generic udf casting, SELECT LPAD(test,5, 0) FROM src LIMIT 1) {code} This appears to be a regression from Spark 1.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5118) Create table test stored as parquet as select ... report error
[ https://issues.apache.org/jira/browse/SPARK-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-5118: -- Description: Caused by: java.lang.RuntimeException: Unhandled clauses: TOK_TBLPARQUETFILE Create table test stored as parquet as select ... report error Key: SPARK-5118 URL: https://issues.apache.org/jira/browse/SPARK-5118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: guowei Caused by: java.lang.RuntimeException: Unhandled clauses: TOK_TBLPARQUETFILE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5104) Distributed Representations of Sentences and Documents
[ https://issues.apache.org/jira/browse/SPARK-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267226#comment-14267226 ] Guoqiang Li commented on SPARK-5104: Dimension reduction in text classification. It has a better performance than the LDA algorithm. The algorithm has been implemented in [gensim|https://github.com/piskvorky/gensim/pull/231] Distributed Representations of Sentences and Documents -- Key: SPARK-5104 URL: https://issues.apache.org/jira/browse/SPARK-5104 Project: Spark Issue Type: Wish Components: ML, MLlib Reporter: Guoqiang Li The Paper [Distributed Representations of Sentences and Documents|http://arxiv.org/abs/1405.4053] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5018) Make MultivariateGaussian public
[ https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267259#comment-14267259 ] Apache Spark commented on SPARK-5018: - User 'tgaloppo' has created a pull request for this issue: https://github.com/apache/spark/pull/3923 Make MultivariateGaussian public Key: SPARK-5018 URL: https://issues.apache.org/jira/browse/SPARK-5018 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Travis Galoppo Priority: Critical MultivariateGaussian is currently private[ml], but it would be a useful public class. This JIRA will require defining a good public API for distributions. This JIRA will be needed for finalizing the GaussianMixtureModel API, which should expose MultivariateGaussian instances instead of the means and covariances. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
[ https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267168#comment-14267168 ] Jongyoul Lee commented on SPARK-3619: - Ok, I'll handle it. Upgrade to Mesos 0.21 to work around MESOS-1688 --- Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Matei Zaharia Assignee: Timothy Chen The Mesos 0.21 release has a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly
[ https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jongyoul Lee updated SPARK-5088: Issue Type: Task (was: Bug) Use spark-class for running executors directly -- Key: SPARK-5088 URL: https://issues.apache.org/jira/browse/SPARK-5088 Project: Spark Issue Type: Task Components: Deploy, Mesos Affects Versions: 1.2.0 Reporter: Jongyoul Lee Priority: Minor - sbin/spark-executor is only used by running executor on mesos environment. - spark-executor calls spark-class without specific parameter internally. - PYTHONPATH is moved to spark-class' case. - Remove a redundant file for maintaining codes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model
Vivek Kulkarni created SPARK-5119: - Summary: java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model Key: SPARK-5119 URL: https://issues.apache.org/jira/browse/SPARK-5119 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.2.0, 1.1.0 Environment: Linux ubuntu 14.04 Reporter: Vivek Kulkarni First I tried to see if there was a bug raised before with similar trace. I found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but the suggestion to upgarde to latest code bae ( I cloned from master branch) does not fix this issue. Issue: try to train a decision tree classifier on some data.After training and when it begins colllect, it crashes: 15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1 15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 1895) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93) at org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100) at org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536 ) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533 ) at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) at org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628) at scala.collection.Iterator$class.foreach(Iterator.scala:727) Minimal code: data = MLUtils.loadLibSVMFile(sc, '/scratch1/vivek/datasets/private/a1a').cache() model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={}, maxDepth=5, maxBins=100) Just download the data from: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5104) Distributed Representations of Sentences and Documents
Guoqiang Li created SPARK-5104: -- Summary: Distributed Representations of Sentences and Documents Key: SPARK-5104 URL: https://issues.apache.org/jira/browse/SPARK-5104 Project: Spark Issue Type: Wish Components: ML, MLlib Reporter: Guoqiang Li The Paper [Distributed Representations of Sentences and Documents|http://arxiv.org/abs/1405.4053] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5122) Remove Shark from spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5122: Summary: Remove Shark from spark-ec2 (was: Remove Shark from spark-ec2 modules) Remove Shark from spark-ec2 --- Key: SPARK-5122 URL: https://issues.apache.org/jira/browse/SPARK-5122 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5122) Remove Shark from spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5122: Description: Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} anymore. (?) Remove Shark from spark-ec2 --- Key: SPARK-5122 URL: https://issues.apache.org/jira/browse/SPARK-5122 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} anymore. (?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267281#comment-14267281 ] Nicholas Chammas commented on SPARK-5122: - cc [~shivaram] - Is it appropriate to just remove the Shark module from {{spark-ec2}}? Remove Shark from spark-ec2 --- Key: SPARK-5122 URL: https://issues.apache.org/jira/browse/SPARK-5122 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5123) Expose only one version of the data type APIs (i.e. remove the Java-specific API)
[ https://issues.apache.org/jira/browse/SPARK-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267301#comment-14267301 ] Apache Spark commented on SPARK-5123: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/3925 Expose only one version of the data type APIs (i.e. remove the Java-specific API) - Key: SPARK-5123 URL: https://issues.apache.org/jira/browse/SPARK-5123 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception
[ https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267282#comment-14267282 ] Apache Spark commented on SPARK-5009: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3924 allCaseVersions function in SqlLexical leads to StackOverflow Exception - Key: SPARK-5009 URL: https://issues.apache.org/jira/browse/SPARK-5009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.0 Reporter: shengli Fix For: 1.3.0, 1.2.1 Original Estimate: 96h Remaining Estimate: 96h Recently I found a bug when I add new feature in SqlParser. Which is : If I define a KeyWord that has a long name. Like: ```protected val :SERDEPROPERTIES = Keyword(SERDEPROPERTIES)``` Since the all case version is implement by recursive function, so when ```implicit asParser`` function is called and the stack memory is very small, it will leads to SO Exception. java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5099) Simplify logistic loss function
[ https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5099: - Assignee: Liang-Chi Hsieh Simplify logistic loss function --- Key: SPARK-5099 URL: https://issues.apache.org/jira/browse/SPARK-5099 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5099) Simplify logistic loss function
[ https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5099. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3899 [https://github.com/apache/spark/pull/3899] Simplify logistic loss function --- Key: SPARK-5099 URL: https://issues.apache.org/jira/browse/SPARK-5099 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 This is a minor pr where I think that we can simply take minus of margin, instead of subtracting margin, in the LogisticGradient. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267296#comment-14267296 ] Shivaram Venkataraman commented on SPARK-5122: -- Yes I think removing shark should be fine. We can also get rid of the Spark - Shark version map in spark_ec2.py Remove Shark from spark-ec2 --- Key: SPARK-5122 URL: https://issues.apache.org/jira/browse/SPARK-5122 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} anymore. (?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5124) Standardize internal RPC interface
Reynold Xin created SPARK-5124: -- Summary: Standardize internal RPC interface Key: SPARK-5124 URL: https://issues.apache.org/jira/browse/SPARK-5124 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu In Spark we use Akka as the RPC layer. It would be great if we can standardize the internal RPC interface to facilitate testing. This will also provide the foundation to try other RPC implementations in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception
[ https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267317#comment-14267317 ] Apache Spark commented on SPARK-5009: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3926 allCaseVersions function in SqlLexical leads to StackOverflow Exception - Key: SPARK-5009 URL: https://issues.apache.org/jira/browse/SPARK-5009 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.0 Reporter: shengli Fix For: 1.3.0, 1.2.1 Original Estimate: 96h Remaining Estimate: 96h Recently I found a bug when I add new feature in SqlParser. Which is : If I define a KeyWord that has a long name. Like: ```protected val :SERDEPROPERTIES = Keyword(SERDEPROPERTIES)``` Since the all case version is implement by recursive function, so when ```implicit asParser`` function is called and the stack memory is very small, it will leads to SO Exception. java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5121) Stored as parquet doens't support the CTAS
[ https://issues.apache.org/jira/browse/SPARK-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang closed SPARK-5121. Resolution: Fixed Stored as parquet doens't support the CTAS -- Key: SPARK-5121 URL: https://issues.apache.org/jira/browse/SPARK-5121 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: hive-0.13.1 Reporter: XiaoJing wang Fix For: 1.2.0 Original Estimate: 4h Remaining Estimate: 4h In the CTAS, stored as parquet is an unsupported Hive feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations
[ https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-4948. - Resolution: Fixed Resolved by: https://github.com/mesos/spark-ec2/pull/86 Use pssh instead of bash-isms and remove unnecessary operations --- Key: SPARK-4948 URL: https://issues.apache.org/jira/browse/SPARK-4948 Project: Spark Issue Type: Sub-task Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary SSH calls to pre-approve keys. Replace bash-isms like {{while ... command ... wait}} with {{pssh}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations
[ https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-4948: Target Version/s: 1.3.0 Use pssh instead of bash-isms and remove unnecessary operations --- Key: SPARK-4948 URL: https://issues.apache.org/jira/browse/SPARK-4948 Project: Spark Issue Type: Sub-task Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary SSH calls to pre-approve keys. Replace bash-isms like {{while ... command ... wait}} with {{pssh}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations
[ https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267331#comment-14267331 ] Nicholas Chammas commented on SPARK-4948: - [~shivaram] Could you assign this issue to me please? Use pssh instead of bash-isms and remove unnecessary operations --- Key: SPARK-4948 URL: https://issues.apache.org/jira/browse/SPARK-4948 Project: Spark Issue Type: Sub-task Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Priority: Minor Fix For: 1.3.0 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary SSH calls to pre-approve keys. Replace bash-isms like {{while ... command ... wait}} with {{pssh}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org