date:20150106


[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265896#comment-14265896
 ] 

Sean Owen commented on SPARK-3452:
--

[~aniket] I think that's a little different. You may find the Spark YARN API 
you want now in spark-network-yarn.

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4585) Spark dynamic executor allocation shouldn't use maxExecutors as initial number

2015-01-06 Thread Lianhui Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265915#comment-14265915
 ] 

Lianhui Wang commented on SPARK-4585:
-

yes, i think initial executors number can be speculated. in most of cases, i 
think initial executors number is tasks number of first level running stages.

 Spark dynamic executor allocation shouldn't use maxExecutors as initial number
 --

 Key: SPARK-4585
 URL: https://issues.apache.org/jira/browse/SPARK-4585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Chengxiang Li

 With SPARK-3174, one can configure a minimum and maximum number of executors 
 for a Spark application on Yarn. However, the application always starts with 
 the maximum. It seems more reasonable, at least for Hive on Spark, to start 
 from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5101) Add common ML math functions


[ 
https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265916#comment-14265916
 ] 

Sean Owen commented on SPARK-5101:
--

(Ah, very good point about overflow!)

 Add common ML math functions
 

 Key: SPARK-5101
 URL: https://issues.apache.org/jira/browse/SPARK-5101
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai
Priority: Minor

 We can add common ML math functions to MLlib. It may be a little tricky to 
 implement those functions in a numerically stable way. For example,
 {code}
 math.log(1 + math.exp(x))
 {code}
 should be implemented as
 {code}
 if (x  0) {
   x + math.log1p(math.exp(-x))
 } else {
   math.log1p(math.exp(x))
 }
 {code}
 It becomes hard to maintain if we have multiple copies of the correct 
 implementation in the codebase. A good place for those functions could be 
 `mllib.util.MathFunctions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5100) Spark Thrift server monitor page

2015-01-06 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-5100:
---
Attachment: prototype-screenshot.png

 Spark Thrift server monitor page
 

 Key: SPARK-5100
 URL: https://issues.apache.org/jira/browse/SPARK-5100
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Web UI
Reporter: Yi Tian
Priority: Critical
 Attachments: Spark Thrift-server monitor page.pdf, 
 prototype-screenshot.png


 In the latest Spark release, there is a Spark Streaming tab on the driver web 
 UI, which shows information about running streaming application. It should be 
 helpful for providing a monitor page in Thrift server, because both streaming 
 and Thrift server are long-term applications, and the details of the 
 application do not show on stage page or job page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5100) Spark Thrift server monitor page

2015-01-06 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-5100:
---
Attachment: (was: Spark Thrift-server monitor page.pdf)

 Spark Thrift server monitor page
 

 Key: SPARK-5100
 URL: https://issues.apache.org/jira/browse/SPARK-5100
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Web UI
Reporter: Yi Tian
Priority: Critical
 Attachments: Spark Thrift-server monitor page.pdf, 
 prototype-screenshot.png


 In the latest Spark release, there is a Spark Streaming tab on the driver web 
 UI, which shows information about running streaming application. It should be 
 helpful for providing a monitor page in Thrift server, because both streaming 
 and Thrift server are long-term applications, and the details of the 
 application do not show on stage page or job page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5100) Spark Thrift server monitor page

2015-01-06 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-5100:
---
Attachment: Spark Thrift-server monitor page.pdf

design doc

 Spark Thrift server monitor page
 

 Key: SPARK-5100
 URL: https://issues.apache.org/jira/browse/SPARK-5100
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Web UI
Reporter: Yi Tian
Priority: Critical
 Attachments: Spark Thrift-server monitor page.pdf, 
 prototype-screenshot.png


 In the latest Spark release, there is a Spark Streaming tab on the driver web 
 UI, which shows information about running streaming application. It should be 
 helpful for providing a monitor page in Thrift server, because both streaming 
 and Thrift server are long-term applications, and the details of the 
 application do not show on stage page or job page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5101) Add common ML math functions


 [ 
https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5101:
-
Priority: Minor  (was: Major)

 Add common ML math functions
 

 Key: SPARK-5101
 URL: https://issues.apache.org/jira/browse/SPARK-5101
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor

 We can add common ML math functions to MLlib. It may be a little tricky to 
 implement those functions in a numerically stable way. For example,
 {code}
 math.log(1 + math.exp(x))
 {code}
 should be implemented as
 {code}
 if (x  0) {
   x + math.log1p(math.exp(-x))
 } else {
   math.log1p(math.exp(x))
 }
 {code}
 It becomes hard to maintain if we have multiple copies of the correct 
 implementation in the codebase. A good place for those functions could be 
 `mllib.util.MathFunctions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2015-01-06 Thread Chaozhong Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaozhong Yang updated SPARK-4850:
--
Comment: was deleted

(was: https://issues.apache.org/jira/secure/ViewProfile.jspa?name=lian+cheng )

 GROUP BY can't work if the schema of SchemaRDD contains struct or array type
 --

 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang
Assignee: Cheng Lian
  Labels: group, sql
   Original Estimate: 96h
  Remaining Estimate: 96h

 Code in Spark Shell as follows:
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val path = path/to/json
 sqlContext.jsonFile(path).register(Table)
 val t = sqlContext.sql(select * from Table group by a)
 t.collect
 {code}
 Let's look into the schema of `Table`
 {code}
 root
  |-- a: integer (nullable = true)
  |-- arr: array (nullable = true)
  ||-- element: integer (containsNull = false)
  |-- createdAt: string (nullable = true)
  |-- f: struct (nullable = true)
  ||-- __type: string (nullable = true)
  ||-- className: string (nullable = true)
  ||-- objectId: string (nullable = true)
  |-- objectId: string (nullable = true)
  |-- s: string (nullable = true)
  |-- updatedAt: string (nullable = true)
 {code}
 Exception will be throwed:
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: arr#9, tree:
 Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
  Subquery TestImport
   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
 MappedRDD[18] at map at JsonRDD.scala:47
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at $iwC$$iwC$$iwC$$iwC.init(console:17)
   at $iwC$$iwC$$iwC.init(console:22)
   at

[jira] [Commented] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2015-01-06 Thread Chaozhong Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265849#comment-14265849
 ] 

Chaozhong Yang commented on SPARK-4850:
---

https://issues.apache.org/jira/secure/ViewProfile.jspa?name=lian+cheng 

 GROUP BY can't work if the schema of SchemaRDD contains struct or array type
 --

 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang
Assignee: Cheng Lian
  Labels: group, sql
   Original Estimate: 96h
  Remaining Estimate: 96h

 Code in Spark Shell as follows:
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val path = path/to/json
 sqlContext.jsonFile(path).register(Table)
 val t = sqlContext.sql(select * from Table group by a)
 t.collect
 {code}
 Let's look into the schema of `Table`
 {code}
 root
  |-- a: integer (nullable = true)
  |-- arr: array (nullable = true)
  ||-- element: integer (containsNull = false)
  |-- createdAt: string (nullable = true)
  |-- f: struct (nullable = true)
  ||-- __type: string (nullable = true)
  ||-- className: string (nullable = true)
  ||-- objectId: string (nullable = true)
  |-- objectId: string (nullable = true)
  |-- s: string (nullable = true)
  |-- updatedAt: string (nullable = true)
 {code}
 Exception will be throwed:
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: arr#9, tree:
 Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
  Subquery TestImport
   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
 MappedRDD[18] at map at JsonRDD.scala:47
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at $iwC$$iwC$$iwC$$iwC.init(console:17)
   at $iwC$$iwC$$iwC.init(console:22)

[jira] [Created] (SPARK-5101) Add common ML math functions

Xiangrui Meng created SPARK-5101:


 Summary: Add common ML math functions
 Key: SPARK-5101
 URL: https://issues.apache.org/jira/browse/SPARK-5101
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


We can add common ML math functions to MLlib. It may be a little tricky to 
implement those functions in a numerically stable way. For example,

{code}
math.log(1 + math.exp(x))
{code}

should be implemented as

{code}
if (x  0) {
  x + math.log1p(math.exp(-x))
} else {
  math.log1p(math.exp(x))
}
{code}

It becomes hard to maintain if we have multiple copies of the correct 
implementation in the codebase. A good place for those functions could be 
`mllib.util.MathFunctions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5101) Add common ML math functions


 [ 
https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5101:
-
Assignee: DB Tsai

 Add common ML math functions
 

 Key: SPARK-5101
 URL: https://issues.apache.org/jira/browse/SPARK-5101
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai
Priority: Minor

 We can add common ML math functions to MLlib. It may be a little tricky to 
 implement those functions in a numerically stable way. For example,
 {code}
 math.log(1 + math.exp(x))
 {code}
 should be implemented as
 {code}
 if (x  0) {
   x + math.log1p(math.exp(-x))
 } else {
   math.log1p(math.exp(x))
 }
 {code}
 It becomes hard to maintain if we have multiple copies of the correct 
 implementation in the codebase. A good place for those functions could be 
 `mllib.util.MathFunctions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream


[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265845#comment-14265845
 ] 

Tathagata Das commented on SPARK-4905:
--

Any insights yet?

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.runTest(FlumeStreamSuite.scala:46)
   at

[jira] [Commented] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type

2015-01-06 Thread Chaozhong Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265848#comment-14265848
 ] 

Chaozhong Yang commented on SPARK-4850:
---

Got it, thanks !

 GROUP BY can't work if the schema of SchemaRDD contains struct or array type
 --

 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang
Assignee: Cheng Lian
  Labels: group, sql
   Original Estimate: 96h
  Remaining Estimate: 96h

 Code in Spark Shell as follows:
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val path = path/to/json
 sqlContext.jsonFile(path).register(Table)
 val t = sqlContext.sql(select * from Table group by a)
 t.collect
 {code}
 Let's look into the schema of `Table`
 {code}
 root
  |-- a: integer (nullable = true)
  |-- arr: array (nullable = true)
  ||-- element: integer (containsNull = false)
  |-- createdAt: string (nullable = true)
  |-- f: struct (nullable = true)
  ||-- __type: string (nullable = true)
  ||-- className: string (nullable = true)
  ||-- objectId: string (nullable = true)
  |-- objectId: string (nullable = true)
  |-- s: string (nullable = true)
  |-- updatedAt: string (nullable = true)
 {code}
 Exception will be throwed:
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: arr#9, tree:
 Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
  Subquery TestImport
   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
 MappedRDD[18] at map at JsonRDD.scala:47
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at $iwC$$iwC$$iwC$$iwC.init(console:17)
   at $iwC$$iwC$$iwC.init(console:22)
   at $iwC$$iwC.init(console:24)
   at

[jira] [Comment Edited] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream


[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265304#comment-14265304
 ] 

Tathagata Das edited comment on SPARK-4905 at 1/6/15 8:31 AM:
--

What is the reason behind such a behavior where the number of records received 
is same as sent, but all the records are empty?



was (Author: tdas):
What is the reason behind such a behavior where the number of records received 
is same as sent, but all the records are empty?

TD

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at

[jira] [Commented] (SPARK-4999) No need to put WAL-backed block into block manager by default


[ 
https://issues.apache.org/jira/browse/SPARK-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265811#comment-14265811
 ] 

Apache Spark commented on SPARK-4999:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/3906

 No need to put WAL-backed block into block manager by default
 -

 Key: SPARK-4999
 URL: https://issues.apache.org/jira/browse/SPARK-4999
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao

 Currently WAL-backed block is read out from HDFS and put into BlockManger 
 with storage level MEMORY_ONLY_SER by default, since WAL-backed block is 
 already fault-tolerant, no need to put into BlockManger again by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5099) Simplify logistic loss function and fix deviance loss function

2015-01-06 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-5099:
--

 Summary: Simplify logistic loss function and fix deviance loss 
function
 Key: SPARK-5099
 URL: https://issues.apache.org/jira/browse/SPARK-5099
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor


This is a minor pr where I think that we can simply take minus of margin, 
instead of subtracting margin, in the LogisticGradient.

Mathematically, they are equal. But the modified equation is the common form of 
logistic loss function and so more readable. It also computes more accurate 
value as some quick tests show.

Besides, there is a bug in computing the loss in LogLoss. This pr fixes it too.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5099) Simplify logistic loss function and fix deviance loss function


[ 
https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265828#comment-14265828
 ] 

Apache Spark commented on SPARK-5099:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/3899

 Simplify logistic loss function and fix deviance loss function
 --

 Key: SPARK-5099
 URL: https://issues.apache.org/jira/browse/SPARK-5099
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor

 This is a minor pr where I think that we can simply take minus of margin, 
 instead of subtracting margin, in the LogisticGradient.
 Mathematically, they are equal. But the modified equation is the common form 
 of logistic loss function and so more readable. It also computes more 
 accurate value as some quick tests show.
 Besides, there is a bug in computing the loss in LogLoss. This pr fixes it 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1600) flaky recovery with file input stream test in streaming.CheckpointSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-1600.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 flaky recovery with file input stream test in streaming.CheckpointSuite
 -

 Key: SPARK-1600
 URL: https://issues.apache.org/jira/browse/SPARK-1600
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.2.0
Reporter: Nan Zhu
  Labels: flaky-test
 Fix For: 1.3.0


 the case recovery with file input stream.recovery with file input stream   
 sometimes fails when the Jenkins is very busy with an unrelated change 
 I have met it for 3 times, I also saw it in other places, 
 the latest example is in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
 where the modification is just in YARN related files
 I once reported in dev mail list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1600) flaky recovery with file input stream test in streaming.CheckpointSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1600:
-
Affects Version/s: (was: 1.3.0)

 flaky recovery with file input stream test in streaming.CheckpointSuite
 -

 Key: SPARK-1600
 URL: https://issues.apache.org/jira/browse/SPARK-1600
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.2.0
Reporter: Nan Zhu
  Labels: flaky-test
 Fix For: 1.3.0


 the case recovery with file input stream.recovery with file input stream   
 sometimes fails when the Jenkins is very busy with an unrelated change 
 I have met it for 3 times, I also saw it in other places, 
 the latest example is in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
 where the modification is just in YARN related files
 I once reported in dev mail list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-06 Thread Daniel Darabos (JIRA)

Daniel Darabos created SPARK-5102:
-

 Summary: CompressedMapStatus needs to be registered with Kryo
 Key: SPARK-5102
 URL: https://issues.apache.org/jira/browse/SPARK-5102
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Daniel Darabos
Priority: Minor


After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:

{code}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is not 
registered: org.apache.spark.scheduler.CompressedMapStatus
Note: To register this class use: 
kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with Kryo. 
I think this should be done in {{spark/serializer/KryoSerializer.scala}}, 
unless instances of this class are not expected to be sent over the wire. 
(Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-06 Thread Aniket Bhatnagar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266156#comment-14266156
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

Ok.. I'll test this out by adding dependency to spark-network-yarn and see how 
it goes. Fingers crossed!

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5099) Simplify logistic loss function

2015-01-06 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-5099:
---
Description: 
This is a minor pr where I think that we can simply take minus of margin, 
instead of subtracting margin, in the LogisticGradient.

Mathematically, they are equal. But the modified equation is the common form of 
logistic loss function and so more readable. It also computes more accurate 
value as some quick tests show.


  was:
This is a minor pr where I think that we can simply take minus of margin, 
instead of subtracting margin, in the LogisticGradient.

Mathematically, they are equal. But the modified equation is the common form of 
logistic loss function and so more readable. It also computes more accurate 
value as some quick tests show.

Besides, there is a bug in computing the loss in LogLoss. This pr fixes it too.



 Simplify logistic loss function
 ---

 Key: SPARK-5099
 URL: https://issues.apache.org/jira/browse/SPARK-5099
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 This is a minor pr where I think that we can simply take minus of margin, 
 instead of subtracting margin, in the LogisticGradient.
 Mathematically, they are equal. But the modified equation is the common form 
 of logistic loss function and so more readable. It also computes more 
 accurate value as some quick tests show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5099) Simplify logistic loss function

2015-01-06 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-5099:
---
Issue Type: Improvement  (was: Bug)

 Simplify logistic loss function
 ---

 Key: SPARK-5099
 URL: https://issues.apache.org/jira/browse/SPARK-5099
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor

 This is a minor pr where I think that we can simply take minus of margin, 
 instead of subtracting margin, in the LogisticGradient.
 Mathematically, they are equal. But the modified equation is the common form 
 of logistic loss function and so more readable. It also computes more 
 accurate value as some quick tests show.
 Besides, there is a bug in computing the loss in LogLoss. This pr fixes it 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4366) Aggregation Optimization

2015-01-06 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266124#comment-14266124
 ] 

Cheng Hao commented on SPARK-4366:
--

[~marmbrus] I've uploaded an draft design doc for the UDAF interface, let me 
know if you have any concerns or confusing.

 Aggregation Optimization
 

 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
 Attachments: aggregatefunction_v1.pdf


 This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4366) Aggregation Optimization

2015-01-06 Thread Cheng Hao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4366:
-
Attachment: aggregatefunction_v1.pdf

Draft Design Doc.

 Aggregation Optimization
 

 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
 Attachments: aggregatefunction_v1.pdf


 This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5103) Add Functionality to Pass Config Options to KeyConverter and ValueConverter in PySpark

2015-01-06 Thread Brett Meyer (JIRA)

Brett Meyer created SPARK-5103:
--

 Summary: Add Functionality to Pass Config Options to KeyConverter 
and ValueConverter in PySpark
 Key: SPARK-5103
 URL: https://issues.apache.org/jira/browse/SPARK-5103
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Brett Meyer
Priority: Minor


Currently when using the provided PySpark loaders and using a KeyConverter or 
ValueConverter class, there is no way to pass in additional information to the 
converter classes.  Would like to add functionality to pass in options either 
through configuration that can be set to the SparkContext, or through 
parameters that can be passed to the KeyConverter and ValueConverter classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5110) Spark-on-Yarn does not work on windows platform

Zhan Zhang created SPARK-5110:
-

 Summary: Spark-on-Yarn does not work on windows platform
 Key: SPARK-5110
 URL: https://issues.apache.org/jira/browse/SPARK-5110
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang


There are some scripts laucnching am and executor  in spark-on-yarn does not 
work with windows platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5112) Expose SizeEstimator as a developer API


[ 
https://issues.apache.org/jira/browse/SPARK-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266646#comment-14266646
 ] 

Apache Spark commented on SPARK-5112:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/3913

 Expose SizeEstimator as a developer API
 ---

 Key: SPARK-5112
 URL: https://issues.apache.org/jira/browse/SPARK-5112
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza

 The best way to size the amount of memory consumption your dataset will 
 require is to create an RDD, put it into cache, and look at the SparkContext 
 logs on your driver program. The logs will tell you how much memory each 
 partition is consuming, which you can aggregate to get the total size of the 
 RDD.
 -the Tuning Spark page
 This is a pain.  It would be much nicer to expose simply functionality for 
 understanding the memory footprint of a Java object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

Patrick Wendell created SPARK-5113:
--

 Summary: Audit and document use of hostnames and IP addresses in 
Spark
 Key: SPARK-5113
 URL: https://issues.apache.org/jira/browse/SPARK-5113
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Critical


Spark has multiple network components that start servers and advertise their 
network addresses to other processes.

We should go through each of these components and make sure they have 
consistent and/or documented behavior wrt (a) what interface(s) they bind to 
and (b) what hostname they use to advertise themselves to other processes. We 
should document this clearly and explain to people what to do in different 
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds 
one that is not a loopback address. Then it will do a reverse DNS lookup for a 
hostname associated with that interface. Then the network components will use 
that hostname to advertise the component to other processes. In some cases, 
that hostname is used as the bind interface also (e.g. I think this happens in 
the connection manager and possibly akka). In other cases (the web UI and netty 
shuffle) we seem to bind to all interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

2015-01-06 Thread Brad Willard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Willard updated SPARK-5075:

Summary: Memory Leak when repartitioning SchemaRDD or running queries in 
general  (was: Memory Leak when repartitioning SchemaRDD from JSON)

 Memory Leak when repartitioning SchemaRDD or running queries in general
 ---

 Key: SPARK-5075
 URL: https://issues.apache.org/jira/browse/SPARK-5075
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.2.0
 Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
Reporter: Brad Willard
  Labels: ec2, json, parquet, pyspark, repartition, s3

 I'm trying to repartition a json dataset for better cpu optimization and save 
 in parquet format for better performance. The Json dataset is about 200gb
 from pyspark.sql import SQLContext
 sql_context = SQLContext(sc)
 rdd = sql_context.jsonFile('s3c://some_path')
 rdd = rdd.repartition(256)
 rdd.saveAsParquetFile('hdfs://some_path')
 In ganglia when the dataset first loads it's about 200G in memory which is 
 expected. However once it attempts the repartition, it balloons over 2.5x in 
 memory which is never returned making any subsequent operations fail from 
 memory errors.
 https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5110) Spark-on-Yarn does not work on windows platform


[ 
https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266599#comment-14266599
 ] 

Sean Owen commented on SPARK-5110:
--

[~zhanzhang] are you intending to add any detail to these JIRAs?
This looks like a duplicate of at least one of:
https://issues.apache.org/jira/browse/SPARK-5034
https://issues.apache.org/jira/browse/SPARK-1825
https://issues.apache.org/jira/browse/SPARK-2221

 Spark-on-Yarn does not work on windows platform
 ---

 Key: SPARK-5110
 URL: https://issues.apache.org/jira/browse/SPARK-5110
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang

 There are some scripts laucnching am and executor  in spark-on-yarn does not 
 work with windows platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

Zhan Zhang created SPARK-5111:
-

 Summary: HiveContext and Thriftserver cannot work in secure 
cluster beyond hadoop2.5
 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang


Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

2015-01-06 Thread Brad Willard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Willard updated SPARK-5075:

Description: 
I'm trying to repartition a json dataset for better cpu optimization and save 
in parquet format for better performance. The Json dataset is about 200gb

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

rdd = sql_context.jsonFile('s3c://some_path')
rdd = rdd.repartition(256)
rdd.saveAsParquetFile('hdfs://some_path')

In ganglia when the dataset first loads it's about 200G in memory which is 
expected. However once it attempts the repartition, it balloons over 2.5x in 
memory which is never returned making any subsequent operations fail from 
memory errors.

https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png


I'm also seeing a similar memory leak behavior when running repeated queries on 
a dataset.

rdd = sql_context.parquetFile('hdfs://some_path')
rdd.registerTempTable('events')

sql_context.sql(  anything  )
sql_context.sql(  anything  )
sql_context.sql(  anything  )
sql_context.sql(  anything  )

will result in a memory usage pattern of.
http://cl.ly/image/180y2D3d1A0X

It seems like intermediate results are not being garbage collected or 
something. Eventually I have to kill my session to keep running queries.

  was:
I'm trying to repartition a json dataset for better cpu optimization and save 
in parquet format for better performance. The Json dataset is about 200gb

from pyspark.sql import SQLContext
sql_context = SQLContext(sc)

rdd = sql_context.jsonFile('s3c://some_path')
rdd = rdd.repartition(256)
rdd.saveAsParquetFile('hdfs://some_path')

In ganglia when the dataset first loads it's about 200G in memory which is 
expected. However once it attempts the repartition, it balloons over 2.5x in 
memory which is never returned making any subsequent operations fail from 
memory errors.

https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png



 Memory Leak when repartitioning SchemaRDD or running queries in general
 ---

 Key: SPARK-5075
 URL: https://issues.apache.org/jira/browse/SPARK-5075
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.2.0
 Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
Reporter: Brad Willard
  Labels: ec2, json, parquet, pyspark, repartition, s3

 I'm trying to repartition a json dataset for better cpu optimization and save 
 in parquet format for better performance. The Json dataset is about 200gb
 from pyspark.sql import SQLContext
 sql_context = SQLContext(sc)
 rdd = sql_context.jsonFile('s3c://some_path')
 rdd = rdd.repartition(256)
 rdd.saveAsParquetFile('hdfs://some_path')
 In ganglia when the dataset first loads it's about 200G in memory which is 
 expected. However once it attempts the repartition, it balloons over 2.5x in 
 memory which is never returned making any subsequent operations fail from 
 memory errors.
 https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
 I'm also seeing a similar memory leak behavior when running repeated queries 
 on a dataset.
 rdd = sql_context.parquetFile('hdfs://some_path')
 rdd.registerTempTable('events')
 sql_context.sql(  anything  )
 sql_context.sql(  anything  )
 sql_context.sql(  anything  )
 sql_context.sql(  anything  )
 will result in a memory usage pattern of.
 http://cl.ly/image/180y2D3d1A0X
 It seems like intermediate results are not being garbage collected or 
 something. Eventually I have to kill my session to keep running queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5107) A trick log info for the start of Receiver


[ 
https://issues.apache.org/jira/browse/SPARK-5107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266597#comment-14266597
 ] 

Apache Spark commented on SPARK-5107:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3912

 A trick log info for the start of Receiver
 --

 Key: SPARK-5107
 URL: https://issues.apache.org/jira/browse/SPARK-5107
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: uncleGen
Priority: Trivial

 Receiver will register itself whenever it begins to start. But, it is trick 
 to log the same information. Especially, at the preStart(),  it will also 
 register itself. Just like the receiver has started twice. Just like:
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/3.JPG!
 We can log the information more clearly. Like the attempt times to start. 
 Of course, nothing matters performance or use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5112) Expose SizeEstimator as a developer API

2015-01-06 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-5112:
-

 Summary: Expose SizeEstimator as a developer API
 Key: SPARK-5112
 URL: https://issues.apache.org/jira/browse/SPARK-5112
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza


The best way to size the amount of memory consumption your dataset will 
require is to create an RDD, put it into cache, and look at the SparkContext 
logs on your driver program. The logs will tell you how much memory each 
partition is consuming, which you can aggregate to get the total size of the 
RDD.
-the Tuning Spark page

This is a pain.  It would be much nicer to expose simply functionality for 
understanding the memory footprint of a Java object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

[
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-5113:
---
Description:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

We should go through each of these components and make sure they have
consistent and/or documented behavior wrt (a) what interface(s) they bind to
and (b) what hostname they use to advertise themselves to other processes. We
should document this clearly and explain to people what to do in different
cases (e.g. EC2, dockerized containers, etc).

When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for a
hostname associated with that interface. Then the network components will use
that hostname to advertise the component to other processes. That hostname is
also the one used for the akka system identifier (akka supports only supplying
a single name which it uses both as the bind interface and as the actor
identifier). In some cases, that hostname is used as the bind hostname also
(e.g. I think this happens in the connection manager and possibly akka) - which
will likely internally result in a re-resolution of this to an IP address. In
other cases (the web UI and netty shuffle) we seem to bind to all interfaces.

The best outcome would be to have three configs that can be set on each machine:

{code}
SPARK_LOCAL_IP # Ip address we bind to for all services
SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within the
cluster
SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the
cluster (e.g. the UI)
{code}

It's not clear how easily we can support that scheme while providing backwards
compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - it's just an
alias for what is now SPARK_PUBLIC_DNS.

was:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

Audit and document use of hostnames and IP addresses in Spark
-

Key: SPARK-5113
URL: https://issues.apache.org/jira/browse/SPARK-5113
Project: Spark
Issue Type: Bug
Reporter: Patrick Wendell
Priority: Critical

Spark has multiple network components that start servers and advertise their
network addresses to other processes.
We should go through each of these components and make sure they have
consistent and/or documented behavior wrt (a) what interface(s) they bind to
and (b) what hostname they use to advertise themselves to other processes. We
should document this clearly and explain to people what to do in different
cases (e.g. EC2, dockerized containers, etc).
When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for
a hostname associated with that interface. Then the network components will
use that hostname to advertise the component to other processes. That
hostname is also the one used for the akka system identifier (akka supports
only supplying a single name which it uses both as the bind interface and as
the actor identifier). In some cases, that hostname is used as the bind
hostname also (e.g. I think this happens in the connection manager and
possibly akka) - which will likely internally result in a re-resolution of
this to an IP address. In other cases (the web UI and netty shuffle) we seem
to bind to all interfaces.
The best outcome would be to have three configs that can be set on each
machine:
{code}

[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

[
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-5113:
---
Description:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for a
hostname associated with that interface. Then the network components will use
that hostname to advertise the component to other processes. In some cases,
that hostname is used as the bind hostname also (e.g. I think this happens in
the connection manager and possibly akka) - which will likely internally result
in a re-resolution of this to an IP address. In other cases (the web UI and
netty shuffle) we seem to bind to all interfaces.

was:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

Audit and document use of hostnames and IP addresses in Spark
-

Key: SPARK-5113
URL: https://issues.apache.org/jira/browse/SPARK-5113
Project: Spark
Issue Type: Bug
Reporter: Patrick Wendell
Priority: Critical

Spark has multiple network components that start servers and advertise their
network addresses to other processes.
We should go through each of these components and make sure they have
consistent and/or documented behavior wrt (a) what interface(s) they bind to
and (b) what hostname they use to advertise themselves to other processes. We
should document this clearly and explain to people what to do in different
cases (e.g. EC2, dockerized containers, etc).
When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for
a hostname associated with that interface. Then the network components will
use that hostname to advertise the component to other processes. In some
cases, that hostname is used as the bind hostname also (e.g. I think this
happens in the connection manager and possibly akka) - which will likely
internally result in a re-resolution of this to an IP address. In other cases
(the web UI and netty shuffle) we seem to bind to all interfaces.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

[
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-5113:
---
Description:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

was:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for a
hostname associated with that interface. Then the network components will use
that hostname to advertise the component to other processes. That hostname is
also the one used for the akka system identifier. In some cases, that hostname
is used as the bind hostname also (e.g. I think this happens in the connection
manager and possibly akka) - which will likely internally result in a
re-resolution of this to an IP address. In other cases (the web UI and netty
shuffle) we seem to bind to all interfaces.

Audit and document use of hostnames and IP addresses in Spark
-

Key: SPARK-5113
URL: https://issues.apache.org/jira/browse/SPARK-5113
Project: Spark
Issue Type: Bug
Reporter: Patrick Wendell
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5075) Memory Leak when repartitioning SchemaRDD or running queries in general

2015-01-06 Thread Brad Willard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Willard updated SPARK-5075:

Labels: ec2 json memory-leak memory_leak parquet pyspark repartition s3  
(was: ec2 json parquet pyspark repartition s3)

 Memory Leak when repartitioning SchemaRDD or running queries in general
 ---

 Key: SPARK-5075
 URL: https://issues.apache.org/jira/browse/SPARK-5075
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.2.0
 Environment: spark-ec2 launched 10 node cluster of type c3.8xlarge
Reporter: Brad Willard
  Labels: ec2, json, memory-leak, memory_leak, parquet, pyspark, 
 repartition, s3

 I'm trying to repartition a json dataset for better cpu optimization and save 
 in parquet format for better performance. The Json dataset is about 200gb
 from pyspark.sql import SQLContext
 sql_context = SQLContext(sc)
 rdd = sql_context.jsonFile('s3c://some_path')
 rdd = rdd.repartition(256)
 rdd.saveAsParquetFile('hdfs://some_path')
 In ganglia when the dataset first loads it's about 200G in memory which is 
 expected. However once it attempts the repartition, it balloons over 2.5x in 
 memory which is never returned making any subsequent operations fail from 
 memory errors.
 https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
 I'm also seeing a similar memory leak behavior when running repeated queries 
 on a dataset.
 rdd = sql_context.parquetFile('hdfs://some_path')
 rdd.registerTempTable('events')
 sql_context.sql(  anything  )
 sql_context.sql(  anything  )
 sql_context.sql(  anything  )
 sql_context.sql(  anything  )
 will result in a memory usage pattern of.
 http://cl.ly/image/180y2D3d1A0X
 It seems like intermediate results are not being garbage collected or 
 something. Eventually I have to kill my session to keep running queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4159) Maven build doesn't run JUnit test suites

2015-01-06 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4159:
--
Target Version/s: 1.1.1, 1.0.3, 1.2.1
   Fix Version/s: 1.3.0
Assignee: Sean Owen
  Labels: backport-needed  (was: )

 Maven build doesn't run JUnit test suites
 -

 Key: SPARK-4159
 URL: https://issues.apache.org/jira/browse/SPARK-4159
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0


 It turns out our Maven build isn't running any Java test suites, and likely 
 hasn't ever.
 After some fishing I believe the following is the issue. We use scalatest [1] 
 in our maven build which, by default can't automatically detect JUnit tests. 
 Scalatest will allow you to enumerate a list of suites via JUnitClasses, 
 but I cant' find a way for it to auto-detect all JUnit tests. It turns out 
 this works in SBT because of our use of the junit-interface[2] which does 
 this for you. 
 An okay fix for this might be to simply enable the normal (surefire) maven 
 tests in addition to our scalatest in the maven build. The only thing to 
 watch out for is that they don't overlap in some way. We'd also have to copy 
 over environment variables, memory settings, etc to that plugin.
 [1] http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin
 [2] https://github.com/sbt/junit-interface



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

[
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-5113:
---
Description:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for a
hostname associated with that interface. Then the network components will use
that hostname to advertise the component to other processes. That hostname is
also the one used for the akka system identifier. In some cases, that hostname
is used as the bind hostname also (e.g. I think this happens in the connection
manager and possibly akka) - which will likely internally result in a
re-resolution of this to an IP address. In other cases (the web UI and netty
shuffle) we seem to bind to all interfaces.

was:
Spark has multiple network components that start servers and advertise their
network addresses to other processes.

When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for a
hostname associated with that interface. Then the network components will use
that hostname to advertise the component to other processes. In some cases,
that hostname is used as the bind hostname also (e.g. I think this happens in
the connection manager and possibly akka) - which will likely internally result
in a re-resolution of this to an IP address. In other cases (the web UI and
netty shuffle) we seem to bind to all interfaces.

Audit and document use of hostnames and IP addresses in Spark
-

Key: SPARK-5113
URL: https://issues.apache.org/jira/browse/SPARK-5113
Project: Spark
Issue Type: Bug
Reporter: Patrick Wendell
Priority: Critical

Spark has multiple network components that start servers and advertise their
network addresses to other processes.
We should go through each of these components and make sure they have
consistent and/or documented behavior wrt (a) what interface(s) they bind to
and (b) what hostname they use to advertise themselves to other processes. We
should document this clearly and explain to people what to do in different
cases (e.g. EC2, dockerized containers, etc).
When Spark initializes, it will search for a network interface until it finds
one that is not a loopback address. Then it will do a reverse DNS lookup for
a hostname associated with that interface. Then the network components will
use that hostname to advertise the component to other processes. That
hostname is also the one used for the akka system identifier. In some cases,
that hostname is used as the bind hostname also (e.g. I think this happens in
the connection manager and possibly akka) - which will likely internally
result in a re-resolution of this to an IP address. In other cases (the web
UI and netty shuffle) we seem to bind to all interfaces.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.


[ 
https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266729#comment-14266729
 ] 

Apache Spark commented on SPARK-5108:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/3914

 Need to make jackson dependency version consistent with hadoop-2.6.0.
 -

 Key: SPARK-5108
 URL: https://issues.apache.org/jira/browse/SPARK-5108
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Zhan Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5018) Make MultivariateGaussian public

2015-01-06 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266907#comment-14266907
 ] 

Travis Galoppo commented on SPARK-5018:
---

Please assign this ticket to me.


 Make MultivariateGaussian public
 

 Key: SPARK-5018
 URL: https://issues.apache.org/jira/browse/SPARK-5018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Critical

 MultivariateGaussian is currently private[ml], but it would be a useful 
 public class.  This JIRA will require defining a good public API for 
 distributions.
 This JIRA will be needed for finalizing the GaussianMixtureModel API, which 
 should expose MultivariateGaussian instances instead of the means and 
 covariances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-06 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266909#comment-14266909
 ] 

Travis Galoppo commented on SPARK-5019:
---

This really can't be completed until MultivariateGaussian is made public 
(SPARK-5018).

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5018) Make MultivariateGaussian public


 [ 
https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5018:
-
Assignee: Travis Galoppo

 Make MultivariateGaussian public
 

 Key: SPARK-5018
 URL: https://issues.apache.org/jira/browse/SPARK-5018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
Priority: Critical

 MultivariateGaussian is currently private[ml], but it would be a useful 
 public class.  This JIRA will require defining a good public API for 
 distributions.
 This JIRA will be needed for finalizing the GaussianMixtureModel API, which 
 should expose MultivariateGaussian instances instead of the means and 
 covariances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5114) Should Evaluator by a PipelineStage

[
https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-5114:
-
Component/s: ML
Description:
Pipelines can currently contain Estimators and Transformers.

Question for debate: Should Pipelines be able to contain Evaluators?

Pros:
* Evaluators take input datasets with particular schema, which should perhaps
be checked before running a Pipeline.

Cons:
* Evaluators do not transform datasets. They produce a scalar (or a few
values), which makes it hard to say how they fit into a Pipeline or a
PipelineModel.
Target Version/s: 1.3.0
Affects Version/s: 1.2.0
Summary: Should Evaluator by a PipelineStage (was: Should )

Should Evaluator by a PipelineStage
---

Key: SPARK-5114
URL: https://issues.apache.org/jira/browse/SPARK-5114
Project: Spark
Issue Type: Question
Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

Pipelines can currently contain Estimators and Transformers.
Question for debate: Should Pipelines be able to contain Evaluators?
Pros:
* Evaluators take input datasets with particular schema, which should perhaps
be checked before running a Pipeline.
Cons:
* Evaluators do not transform datasets. They produce a scalar (or a few
values), which makes it hard to say how they fit into a Pipeline or a
PipelineModel.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-06 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266986#comment-14266986
 ] 

Kai Sasaki commented on SPARK-5019:
---

I'm sorry for submitting premature PR. Is it OK to ask anyone to assign some 
tickets I want to take from next time? Because I seem no rights to assign 
issues to myself. 

I want to check (SPARK-5018) and review it. Sorry for disturbing you [~tgaloppo]


 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266997#comment-14266997
 ] 

Joseph K. Bradley commented on SPARK-5019:
--

No problem; thanks for your understanding.  If you'd like to work on an item, 
I'd post a comment on the JIRA saying that you want to work on it  asking an 
admin to assign it to you.  Even if an admin does not see it immediately, 
anyone else who wants to work on the JIRA will see your comment.

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5114) Should

Joseph K. Bradley created SPARK-5114:


 Summary: Should 
 Key: SPARK-5114
 URL: https://issues.apache.org/jira/browse/SPARK-5114
 Project: Spark
  Issue Type: Question
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules

2015-01-06 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-5115:


 Summary: Intellij fails to find hadoop classes in Spark yarn 
modules
 Key: SPARK-5115
 URL: https://issues.apache.org/jira/browse/SPARK-5115
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Ryan Williams


Intellij's parsing of Spark's POMs works like a charm for the most part, 
however it fails to resolve the hadoop and yarn dependencies in the Spark 
{{yarn}} and {{network/yarn}} modules.

Imports and later references to imported classes show up as errors, e.g.

!http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!

Opening the module settings, we see that IntelliJ is looking for version 
{{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
 and failing to find them:


!http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!


This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
{{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].

AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and 
may just be an accident of history; people typically select a Maven profile 
when building Spark that matches the version of Hadoop that they intend to run 
with.

This suggests one possible fix: bump the default Hadoop version to = 2. I've 
tried this locally and it resolves Intellij's difficulties with the yarn and 
network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] 
does this.

Another fix would be to declare a {{hadoop.version}} property in 
{{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
#3918|https://github.com/apache/spark/pull/3918] does that.

It is more obvious to me in the former case that the existing rules that govern 
what {{hadoop.version}} the YARN dependencies inherit will still apply.

For the latter, or potentially other ways to configure IntelliJ / Spark's POMs, 
someone with more Maven/IntelliJ fu may need to chime in.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules


[ 
https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267035#comment-14267035
 ] 

Apache Spark commented on SPARK-5115:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/3917

 Intellij fails to find hadoop classes in Spark yarn modules
 -

 Key: SPARK-5115
 URL: https://issues.apache.org/jira/browse/SPARK-5115
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Intellij's parsing of Spark's POMs works like a charm for the most part, 
 however it fails to resolve the hadoop and yarn dependencies in the Spark 
 {{yarn}} and {{network/yarn}} modules.
 Imports and later references to imported classes show up as errors, e.g.
 !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!
 Opening the module settings, we see that IntelliJ is looking for version 
 {{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
 on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
  and failing to find them:
 !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!
 This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
 {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].
 AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important 
 and may just be an accident of history; people typically select a Maven 
 profile when building Spark that matches the version of Hadoop that they 
 intend to run with.
 This suggests one possible fix: bump the default Hadoop version to = 2. I've 
 tried this locally and it resolves Intellij's difficulties with the yarn 
 and network/yarn modules; [PR 
 #3917|https://github.com/apache/spark/pull/3917] does this.
 Another fix would be to declare a {{hadoop.version}} property in 
 {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
 #3918|https://github.com/apache/spark/pull/3918] does that.
 It is more obvious to me in the former case that the existing rules that 
 govern what {{hadoop.version}} the YARN dependencies inherit will still apply.
 For the latter, or potentially other ways to configure IntelliJ / Spark's 
 POMs, someone with more Maven/IntelliJ fu may need to chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules


[ 
https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267036#comment-14267036
 ] 

Apache Spark commented on SPARK-5115:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/3918

 Intellij fails to find hadoop classes in Spark yarn modules
 -

 Key: SPARK-5115
 URL: https://issues.apache.org/jira/browse/SPARK-5115
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Intellij's parsing of Spark's POMs works like a charm for the most part, 
 however it fails to resolve the hadoop and yarn dependencies in the Spark 
 {{yarn}} and {{network/yarn}} modules.
 Imports and later references to imported classes show up as errors, e.g.
 !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!
 Opening the module settings, we see that IntelliJ is looking for version 
 {{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
 on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
  and failing to find them:
 !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!
 This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
 {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].
 AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important 
 and may just be an accident of history; people typically select a Maven 
 profile when building Spark that matches the version of Hadoop that they 
 intend to run with.
 This suggests one possible fix: bump the default Hadoop version to = 2. I've 
 tried this locally and it resolves Intellij's difficulties with the yarn 
 and network/yarn modules; [PR 
 #3917|https://github.com/apache/spark/pull/3917] does this.
 Another fix would be to declare a {{hadoop.version}} property in 
 {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
 #3918|https://github.com/apache/spark/pull/3918] does that.
 It is more obvious to me in the former case that the existing rules that 
 govern what {{hadoop.version}} the YARN dependencies inherit will still apply.
 For the latter, or potentially other ways to configure IntelliJ / Spark's 
 POMs, someone with more Maven/IntelliJ fu may need to chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-06 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267061#comment-14267061
 ] 

Travis Galoppo edited comment on SPARK-5019 at 1/7/15 12:24 AM:


No problem,[~lewuathe] ... I have just started work on SPARK-5018.  If you 
would like to re-visit this ticket once that one is complete, that would be 
great!



was (Author: tgaloppo):
No problem,@lewuathe ... I have just started work on SPARK-5018.  If you would 
like to re-visit this ticket once that one is complete, that would be great!


 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules

2015-01-06 Thread Ryan Williams (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5115:
-
Description: 
Intellij's parsing of Spark's POMs works like a charm for the most part, 
however it fails to resolve the hadoop and yarn dependencies in the Spark 
{{yarn}} and {{network/yarn}} modules.

Imports and later references to imported classes show up as errors, e.g.

!http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!

Opening the module settings, we see that IntelliJ is looking for version 
{{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
 and failing to find them:


!http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!


This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
{{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].

AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and 
may just be an accident of history; people typically select a Maven profile 
when building Spark that matches the version of Hadoop that they intend to run 
with.

This suggests one possible fix: bump the default Hadoop version to = 2. I've 
tried this locally and it resolves Intellij's difficulties with the yarn and 
network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] 
does this.

Another fix would be to declare a {{hadoop.version}} property in 
{{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
#3918|https://github.com/apache/spark/pull/3918] does that.

It is more obvious to me in the former case that the existing rules that govern 
what {{hadoop.version}} the YARN dependencies inherit will still apply.

For the latter, or potentially other ways to configure IntelliJ / Spark's POMs 
to address this issue, someone with more Maven/IntelliJ fu may need to chime in.



  was:
Intellij's parsing of Spark's POMs works like a charm for the most part, 
however it fails to resolve the hadoop and yarn dependencies in the Spark 
{{yarn}} and {{network/yarn}} modules.

Imports and later references to imported classes show up as errors, e.g.

!http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!

Opening the module settings, we see that IntelliJ is looking for version 
{{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
 and failing to find them:


!http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!


This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
{{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].

AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important and 
may just be an accident of history; people typically select a Maven profile 
when building Spark that matches the version of Hadoop that they intend to run 
with.

This suggests one possible fix: bump the default Hadoop version to = 2. I've 
tried this locally and it resolves Intellij's difficulties with the yarn and 
network/yarn modules; [PR #3917|https://github.com/apache/spark/pull/3917] 
does this.

Another fix would be to declare a {{hadoop.version}} property in 
{{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
#3918|https://github.com/apache/spark/pull/3918] does that.

It is more obvious to me in the former case that the existing rules that govern 
what {{hadoop.version}} the YARN dependencies inherit will still apply.

For the latter, or potentially other ways to configure IntelliJ / Spark's POMs, 
someone with more Maven/IntelliJ fu may need to chime in.




 Intellij fails to find hadoop classes in Spark yarn modules
 -

 Key: SPARK-5115
 URL: https://issues.apache.org/jira/browse/SPARK-5115
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Intellij's parsing of Spark's POMs works like a charm for the most part, 
 however it fails to resolve the hadoop and yarn dependencies in the Spark 
 {{yarn}} and {{network/yarn}} modules.
 Imports and later references to imported classes show up as errors, e.g.
 !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!
 Opening the module settings, we see that IntelliJ is looking for version 
 {{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
 on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
  and failing to find them:

[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules

2015-01-06 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267047#comment-14267047
 ] 

Ryan Williams commented on SPARK-5115:
--

FTR the IntelliJ problem I'm referring to is simply its current inability to 
resolve imports (see first image in OP) and the red errors / loss of various 
code-inspection functionality that results.

This is not an issue related to compilations failing within IntelliJ, which it 
sounds like your comments about profile-setting would be relevant to.


 Intellij fails to find hadoop classes in Spark yarn modules
 -

 Key: SPARK-5115
 URL: https://issues.apache.org/jira/browse/SPARK-5115
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Intellij's parsing of Spark's POMs works like a charm for the most part, 
 however it fails to resolve the hadoop and yarn dependencies in the Spark 
 {{yarn}} and {{network/yarn}} modules.
 Imports and later references to imported classes show up as errors, e.g.
 !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!
 Opening the module settings, we see that IntelliJ is looking for version 
 {{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
 on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
  and failing to find them:
 !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!
 This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
 {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].
 AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important 
 and may just be an accident of history; people typically select a Maven 
 profile when building Spark that matches the version of Hadoop that they 
 intend to run with.
 This suggests one possible fix: bump the default Hadoop version to = 2. I've 
 tried this locally and it resolves Intellij's difficulties with the yarn 
 and network/yarn modules; [PR 
 #3917|https://github.com/apache/spark/pull/3917] does this.
 Another fix would be to declare a {{hadoop.version}} property in 
 {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
 #3918|https://github.com/apache/spark/pull/3918] does that.
 It is more obvious to me in the former case that the existing rules that 
 govern what {{hadoop.version}} the YARN dependencies inherit will still apply.
 For the latter, or potentially other ways to configure IntelliJ / Spark's 
 POMs to address this issue, someone with more Maven/IntelliJ fu may need to 
 chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5115) Intellij fails to find hadoop classes in Spark yarn modules


[ 
https://issues.apache.org/jira/browse/SPARK-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267074#comment-14267074
 ] 

Sean Owen commented on SPARK-5115:
--

I just deleted my IntelliJ project config for Spark ({{.idea/}}, all {{.iml}}) 
and reimported the Maven build from master, and chose all defaults. The build 
is fine for me* and {{yarn/}} is not even a module, since the {{yarn}} profile 
is not on by default, which turns on this module. So I think you have somehow 
activated the YARN-related module but it takes another step or two to do that 
in the build -- activate profile {{yarn}} and {{hadoop-2.4}} for example is 
what I do.

If I turn these profiles, reimport the Maven project, and rebuild in IntelliJ, 
{{yarn}} becomes a module and it builds OK for me.

I hope that resolves the compile error you see and gets rid of the red. This is 
why I'm saying I don't see that there's a basic developer sanity problem to 
fix. The build seems to do what it's supposed to when put into IntelliJ.

To me, separately, the idea of updating the Hadoop default to something more 
modern (Hadoop 2.4? YARN-enabled?) sounds fine on its own, not because it 
solves a problem but just because it feels like a more sensible default in 2015.

* I find I have to press the 'generate sources' button in IJ before the first 
build or else Make won't find the generated sources in the flume-sink module, 
but I think that's not related here
** Hm, I see some crazy-looking compiler errors from the Catalyst DSL package 
the first time I compile, that then go away, but I also think that's something 
unrelated or to do with code generation

 Intellij fails to find hadoop classes in Spark yarn modules
 -

 Key: SPARK-5115
 URL: https://issues.apache.org/jira/browse/SPARK-5115
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Ryan Williams

 Intellij's parsing of Spark's POMs works like a charm for the most part, 
 however it fails to resolve the hadoop and yarn dependencies in the Spark 
 {{yarn}} and {{network/yarn}} modules.
 Imports and later references to imported classes show up as errors, e.g.
 !http://f.cl.ly/items/0g3w3s0t45402z30011l/Screen%20Shot%202015-01-06%20at%206.42.52%20PM.png!
 Opening the module settings, we see that IntelliJ is looking for version 
 {{1.0.4}} of [each yarn JAR that the Spark YARN module depends 
 on|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/yarn/pom.xml#L41-L56],
  and failing to find them:
 !http://f.cl.ly/items/2d320l2h2o2N1m0t2X3b/yarn.png!
 This, in turn, is due to the parent POM [defaulting {{hadoop.version}} to 
 {{1.0.4}}|https://github.com/apache/spark/blob/bb38ebb1abd26b57525d7d29703fd449e40cd6de/pom.xml#L122].
 AFAIK, having the default-hadoop-version be {{1.0.4}} is not that important 
 and may just be an accident of history; people typically select a Maven 
 profile when building Spark that matches the version of Hadoop that they 
 intend to run with.
 This suggests one possible fix: bump the default Hadoop version to = 2. I've 
 tried this locally and it resolves Intellij's difficulties with the yarn 
 and network/yarn modules; [PR 
 #3917|https://github.com/apache/spark/pull/3917] does this.
 Another fix would be to declare a {{hadoop.version}} property in 
 {{yarn/pom.xml}} and add that to the YARN dependencies in that file; [PR 
 #3918|https://github.com/apache/spark/pull/3918] does that.
 It is more obvious to me in the former case that the existing rules that 
 govern what {{hadoop.version}} the YARN dependencies inherit will still apply.
 For the latter, or potentially other ways to configure IntelliJ / Spark's 
 POMs to address this issue, someone with more Maven/IntelliJ fu may need to 
 chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5108) Need to make jackson dependency version consistent with hadoop-2.6.0.


 [ 
https://issues.apache.org/jira/browse/SPARK-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-5108:
--
Summary: Need to make jackson dependency version consistent with 
hadoop-2.6.0.  (was: Need to add more jackson dependency for hadoop-2.6.0 
support.)

 Need to make jackson dependency version consistent with hadoop-2.6.0.
 -

 Key: SPARK-5108
 URL: https://issues.apache.org/jira/browse/SPARK-5108
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Zhan Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266740#comment-14266740
 ] 

Joseph K. Bradley commented on SPARK-5019:
--

[~lewuathe]  I would recommend getting this JIRA assigned to you before 
submitting a PR, to make sure no one else is working on it.  In particular, I 
believe [~tgaloppo] was planning on handling this JIRA after his current PR 
[https://github.com/apache/spark/pull/3871].  Can you please coordinate with 
him on how to divide up the JIRAs?  Thanks!

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5110) Spark-on-Yarn does not work on windows platform


[ 
https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266744#comment-14266744
 ] 

Zhan Zhang commented on SPARK-5110:
---

You are right. I will make this duplicated.

 Spark-on-Yarn does not work on windows platform
 ---

 Key: SPARK-5110
 URL: https://issues.apache.org/jira/browse/SPARK-5110
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang

 There are some scripts laucnching am and executor  in spark-on-yarn does not 
 work with windows platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5110) Spark-on-Yarn does not work on windows platform


 [ 
https://issues.apache.org/jira/browse/SPARK-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang closed SPARK-5110.
-
Resolution: Duplicate

 Spark-on-Yarn does not work on windows platform
 ---

 Key: SPARK-5110
 URL: https://issues.apache.org/jira/browse/SPARK-5110
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang

 There are some scripts laucnching am and executor  in spark-on-yarn does not 
 work with windows platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library


[ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266897#comment-14266897
 ] 

Apache Spark commented on SPARK-4924:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3916

 Factor out code to launch Spark applications into a separate library
 

 Key: SPARK-4924
 URL: https://issues.apache.org/jira/browse/SPARK-4924
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin
 Attachments: spark-launcher.txt


 One of the questions we run into rather commonly is how to start a Spark 
 application from my Java/Scala program?. There currently isn't a good answer 
 to that:
 - Instantiating SparkContext has limitations (e.g., you can only have one 
 active context at the moment, plus you lose the ability to submit apps in 
 cluster mode)
 - Calling SparkSubmit directly is doable but you lose a lot of the logic 
 handled by the shell scripts
 - Calling the shell script directly is doable,  but sort of ugly from an API 
 point of view.
 I think it would be nice to have a small library that handles that for users. 
 On top of that, this library could be used by Spark itself to replace a lot 
 of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5050) Add unit test for sqdist


 [ 
https://issues.apache.org/jira/browse/SPARK-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5050.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3869
[https://github.com/apache/spark/pull/3869]

 Add unit test for sqdist
 

 Key: SPARK-5050
 URL: https://issues.apache.org/jira/browse/SPARK-5050
 Project: Spark
  Issue Type: Test
Reporter: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 Related to #3643. Follow the previous suggestion to add unit test for sqdist 
 in VectorsSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5050) Add unit test for sqdist


 [ 
https://issues.apache.org/jira/browse/SPARK-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5050:
-
Assignee: Liang-Chi Hsieh

 Add unit test for sqdist
 

 Key: SPARK-5050
 URL: https://issues.apache.org/jira/browse/SPARK-5050
 Project: Spark
  Issue Type: Test
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 Related to #3643. Follow the previous suggestion to add unit test for sqdist 
 in VectorsSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266589#comment-14266589
 ] 

Apache Spark commented on SPARK-4296:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3910

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266598#comment-14266598
 ] 

Apache Spark commented on SPARK-5019:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/3911

 Update GMM API to use MultivariateGaussian
 --

 Key: SPARK-5019
 URL: https://issues.apache.org/jira/browse/SPARK-5019
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Blocker

 The GaussianMixtureModel API should expose MultivariateGaussian instances 
 instead of the means and covariances.  This should be fixed as soon as 
 possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5101) Add common ML math functions


[ 
https://issues.apache.org/jira/browse/SPARK-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266755#comment-14266755
 ] 

Apache Spark commented on SPARK-5101:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3915

 Add common ML math functions
 

 Key: SPARK-5101
 URL: https://issues.apache.org/jira/browse/SPARK-5101
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai
Priority: Minor

 We can add common ML math functions to MLlib. It may be a little tricky to 
 implement those functions in a numerically stable way. For example,
 {code}
 math.log(1 + math.exp(x))
 {code}
 should be implemented as
 {code}
 if (x  0) {
   x + math.log1p(math.exp(-x))
 } else {
   math.log1p(math.exp(x))
 }
 {code}
 It becomes hard to maintain if we have multiple copies of the correct 
 implementation in the codebase. A good place for those functions could be 
 `mllib.util.MathFunctions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5017) GaussianMixtureEM should use SVD for Gaussian initialization


 [ 
https://issues.apache.org/jira/browse/SPARK-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5017.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3871
[https://github.com/apache/spark/pull/3871]

 GaussianMixtureEM should use SVD for Gaussian initialization
 

 Key: SPARK-5017
 URL: https://issues.apache.org/jira/browse/SPARK-5017
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
 Fix For: 1.3.0


 GaussianMixtureEM effectively does 2 matrix decompositions in Gaussian 
 initialization (pinv and det).  Instead, it should do SVD and use that result 
 to compute the inverse and det.  This will also prevent failure when the 
 matrix is singular.
 Note: Breeze pinv fails when the matrix is singular: 
 [https://github.com/scalanlp/breeze/issues/304]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267176#comment-14267176
 ] 

Apache Spark commented on SPARK-5116:
-

User 'coderxiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3919

 Add extractor for SparseVector and DenseVector in MLlib 
 

 Key: SPARK-5116
 URL: https://issues.apache.org/jira/browse/SPARK-5116
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Priority: Minor

 Add extractor for SparseVector and DenseVector in MLlib to save some code 
 while performing pattern matching on Vectors. For example, previously we need 
 to use:
 vec match {
   case dv: DenseVector =
 val values = dv.values
 ...
   case sv: SparseVector =
 val indices = sv.indices
 val values = sv.values
 val size = sv.size
 ...
 }
 with extractor it is:
 vec match {
   case DenseVector(values) =
 ...
   case SparseVector(size, indices, values) =
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib

2015-01-06 Thread Shuo Xiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-5116:
--
Description: 
Add extractor for SparseVector and DenseVector in MLlib to save some code while 
performing pattern matching on Vectors. For example, previously we need to use:

{code:title=RankingMetrics.scala|borderStyle=solid}
vec match {
  case dv: DenseVector =
val values = dv.values
...
  case sv: SparseVector =
val indices = sv.indices
val values = sv.values
val size = sv.size
...
}
{code}

with extractor it is:
{code:title=RankingMetrics.scala|borderStyle=solid}
vec match {
  case DenseVector(values) =
...
  case SparseVector(size, indices, values) =
...
}
{code}

  was:
Add extractor for SparseVector and DenseVector in MLlib to save some code while 
performing pattern matching on Vectors. For example, previously we need to use:

vec match {
  case dv: DenseVector =
val values = dv.values
...
  case sv: SparseVector =
val indices = sv.indices
val values = sv.values
val size = sv.size
...
}

with extractor it is:
vec match {
  case DenseVector(values) =
...
  case SparseVector(size, indices, values) =
...
}


 Add extractor for SparseVector and DenseVector in MLlib 
 

 Key: SPARK-5116
 URL: https://issues.apache.org/jira/browse/SPARK-5116
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Priority: Minor

 Add extractor for SparseVector and DenseVector in MLlib to save some code 
 while performing pattern matching on Vectors. For example, previously we need 
 to use:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 vec match {
   case dv: DenseVector =
 val values = dv.values
 ...
   case sv: SparseVector =
 val indices = sv.indices
 val values = sv.values
 val size = sv.size
 ...
 }
 {code}
 with extractor it is:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 vec match {
   case DenseVector(values) =
 ...
   case SparseVector(size, indices, values) =
 ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib

2015-01-06 Thread Shuo Xiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-5116:
--
Description: 
Add extractor for SparseVector and DenseVector in MLlib to save some code while 
performing pattern matching on Vectors. For example, previously we need to use:

{code:title=A.scala|borderStyle=solid}
vec match {
  case dv: DenseVector =
val values = dv.values
...
  case sv: SparseVector =
val indices = sv.indices
val values = sv.values
val size = sv.size
...
}
{code}

with extractor it is:
{code:title=B.scala|borderStyle=solid}
vec match {
  case DenseVector(values) =
...
  case SparseVector(size, indices, values) =
...
}
{code}

  was:
Add extractor for SparseVector and DenseVector in MLlib to save some code while 
performing pattern matching on Vectors. For example, previously we need to use:

{code:title=RankingMetrics.scala|borderStyle=solid}
vec match {
  case dv: DenseVector =
val values = dv.values
...
  case sv: SparseVector =
val indices = sv.indices
val values = sv.values
val size = sv.size
...
}
{code}

with extractor it is:
{code:title=RankingMetrics.scala|borderStyle=solid}
vec match {
  case DenseVector(values) =
...
  case SparseVector(size, indices, values) =
...
}
{code}


 Add extractor for SparseVector and DenseVector in MLlib 
 

 Key: SPARK-5116
 URL: https://issues.apache.org/jira/browse/SPARK-5116
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Priority: Minor

 Add extractor for SparseVector and DenseVector in MLlib to save some code 
 while performing pattern matching on Vectors. For example, previously we need 
 to use:
 {code:title=A.scala|borderStyle=solid}
 vec match {
   case dv: DenseVector =
 val values = dv.values
 ...
   case sv: SparseVector =
 val indices = sv.indices
 val values = sv.values
 val size = sv.size
 ...
 }
 {code}
 with extractor it is:
 {code:title=B.scala|borderStyle=solid}
 vec match {
   case DenseVector(values) =
 ...
   case SparseVector(size, indices, values) =
 ...
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5118) Create table test stored as parquet as select ... report error

2015-01-06 Thread guowei (JIRA)

guowei created SPARK-5118:
-

 Summary: Create table test stored as parquet as select ... 
report error
 Key: SPARK-5118
 URL: https://issues.apache.org/jira/browse/SPARK-5118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5121) Stored as parquet doens't support the CTAS

2015-01-06 Thread XiaoJing wang (JIRA)

XiaoJing wang created SPARK-5121:


 Summary: Stored as parquet doens't support the CTAS
 Key: SPARK-5121
 URL: https://issues.apache.org/jira/browse/SPARK-5121
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: hive-0.13.1
Reporter: XiaoJing wang
 Fix For: 1.2.0


In the CTAS, stored as parquet  is an unsupported Hive feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5120) Output the thread name in log4j.properties

2015-01-06 Thread WangTaoTheTonic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic updated SPARK-5120:
---
Issue Type: Improvement  (was: Bug)

 Output the thread name in log4j.properties
 --

 Key: SPARK-5120
 URL: https://issues.apache.org/jira/browse/SPARK-5120
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor

 In most case the thread name is very useful to analyse running job, it is 
 better to log it out in log4j properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5120) Output the thread name in log4j.properties

2015-01-06 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-5120:
--

 Summary: Output the thread name in log4j.properties
 Key: SPARK-5120
 URL: https://issues.apache.org/jira/browse/SPARK-5120
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor


In most case the thread name is very useful to analyse running job, it is 
better to log it out in log4j properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5090) The improvement of python converter for hbase


[ 
https://issues.apache.org/jira/browse/SPARK-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267184#comment-14267184
 ] 

Apache Spark commented on SPARK-5090:
-

User 'GenTang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3920

 The improvement of python converter for hbase
 -

 Key: SPARK-5090
 URL: https://issues.apache.org/jira/browse/SPARK-5090
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Gen TANG
  Labels: hbase, python
 Fix For: 1.2.1

   Original Estimate: 168h
  Remaining Estimate: 168h

 The python converter `HBaseResultToStringConverter` provided in the 
 HBaseConverter.scala returns only the value of first column in the result. It 
 limits the utility of this converter, because it returns only one value per 
 row(perhaps there are several version in hbase) and moreover it loses the 
 other information of record, such as column:cell, timestamp. 
 Here we would like to propose an improvement about python converter which 
 returns all the records in the results (in a single string) with more 
 complete information. We would like also make some improvements for 
 hbase_inputformat.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5118) Create table test stored as parquet as select ... report error


[ 
https://issues.apache.org/jira/browse/SPARK-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267224#comment-14267224
 ] 

Apache Spark commented on SPARK-5118:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/3921

 Create table test stored as parquet as select ... report error
 

 Key: SPARK-5118
 URL: https://issues.apache.org/jira/browse/SPARK-5118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5120) Output the thread name in log4j.properties


[ 
https://issues.apache.org/jira/browse/SPARK-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267256#comment-14267256
 ] 

Apache Spark commented on SPARK-5120:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3922

 Output the thread name in log4j.properties
 --

 Key: SPARK-5120
 URL: https://issues.apache.org/jira/browse/SPARK-5120
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor

 In most case the thread name is very useful to analyse running job, it is 
 better to log it out in log4j properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5116) Add extractor for SparseVector and DenseVector in MLlib

2015-01-06 Thread Shuo Xiang (JIRA)

Shuo Xiang created SPARK-5116:
-

 Summary: Add extractor for SparseVector and DenseVector in MLlib 
 Key: SPARK-5116
 URL: https://issues.apache.org/jira/browse/SPARK-5116
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Priority: Minor


Add extractor for SparseVector and DenseVector in MLlib to save some code while 
performing pattern matching on Vectors. For example, previously we need to use:

vec match {
  case dv: DenseVector =
val values = dv.values
...
  case sv: SparseVector =
val indices = sv.indices
val values = sv.values
val size = sv.size
...
}

with extractor it is:
vec match {
  case DenseVector(values) =
...
  case SparseVector(size, indices, values) =
...
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5117) Hive Generic UDFs don't cast correctly

2015-01-06 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-5117:
---

 Summary: Hive Generic UDFs don't cast correctly
 Key: SPARK-5117
 URL: https://issues.apache.org/jira/browse/SPARK-5117
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
Assignee: Cheng Hao
Priority: Blocker


Here's a test cast that is failing in master:

{code}
  createQueryTest(generic udf casting,
SELECT LPAD(test,5, 0) FROM src LIMIT 1)
{code}

This appears to be a regression from Spark 1.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5118) Create table test stored as parquet as select ... report error

2015-01-06 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-5118:
--
Description: Caused by: java.lang.RuntimeException: Unhandled clauses: 
TOK_TBLPARQUETFILE

 Create table test stored as parquet as select ... report error
 

 Key: SPARK-5118
 URL: https://issues.apache.org/jira/browse/SPARK-5118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei

 Caused by: java.lang.RuntimeException: Unhandled clauses: TOK_TBLPARQUETFILE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5104) Distributed Representations of Sentences and Documents

2015-01-06 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267226#comment-14267226
 ] 

Guoqiang Li commented on SPARK-5104:


Dimension reduction in text classification.  It has a better performance than 
the LDA algorithm.
The algorithm has been implemented in 
[gensim|https://github.com/piskvorky/gensim/pull/231] 

 Distributed Representations of Sentences and Documents
 --

 Key: SPARK-5104
 URL: https://issues.apache.org/jira/browse/SPARK-5104
 Project: Spark
  Issue Type: Wish
  Components: ML, MLlib
Reporter: Guoqiang Li

 The Paper [Distributed Representations of Sentences and 
 Documents|http://arxiv.org/abs/1405.4053]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5018) Make MultivariateGaussian public


[ 
https://issues.apache.org/jira/browse/SPARK-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267259#comment-14267259
 ] 

Apache Spark commented on SPARK-5018:
-

User 'tgaloppo' has created a pull request for this issue:
https://github.com/apache/spark/pull/3923

 Make MultivariateGaussian public
 

 Key: SPARK-5018
 URL: https://issues.apache.org/jira/browse/SPARK-5018
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Travis Galoppo
Priority: Critical

 MultivariateGaussian is currently private[ml], but it would be a useful 
 public class.  This JIRA will require defining a good public API for 
 distributions.
 This JIRA will be needed for finalizing the GaussianMixtureModel API, which 
 should expose MultivariateGaussian instances instead of the means and 
 covariances.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2015-01-06 Thread Jongyoul Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267168#comment-14267168
 ] 

Jongyoul Lee commented on SPARK-3619:
-

Ok, I'll handle it.

 Upgrade to Mesos 0.21 to work around MESOS-1688
 ---

 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia
Assignee: Timothy Chen

 The Mesos 0.21 release has a fix for 
 https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5088) Use spark-class for running executors directly

2015-01-06 Thread Jongyoul Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-5088:

Issue Type: Task  (was: Bug)

 Use spark-class for running executors directly
 --

 Key: SPARK-5088
 URL: https://issues.apache.org/jira/browse/SPARK-5088
 Project: Spark
  Issue Type: Task
  Components: Deploy, Mesos
Affects Versions: 1.2.0
Reporter: Jongyoul Lee
Priority: Minor

 - sbin/spark-executor is only used by running executor on mesos environment.
 - spark-executor calls spark-class without specific parameter internally.
 - PYTHONPATH is moved to spark-class' case.
 - Remove a redundant file for maintaining codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5119) java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model

2015-01-06 Thread Vivek Kulkarni (JIRA)

Vivek Kulkarni created SPARK-5119:
-

 Summary: java.lang.ArrayIndexOutOfBoundsException on trying to 
train decision tree model
 Key: SPARK-5119
 URL: https://issues.apache.org/jira/browse/SPARK-5119
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.2.0, 1.1.0
 Environment: Linux ubuntu 14.04
Reporter: Vivek Kulkarni


First I tried to see if there was a bug raised before with similar trace. I 
found https://www.mail-archive.com/user@spark.apache.org/msg13708.html but the 
suggestion to upgarde to latest code bae ( I cloned from master branch) does 
not fix this issue.

Issue: try to train a decision tree classifier on some data.After training and 
when it begins colllect, it crashes:

15/01/06 22:28:15 INFO BlockManagerMaster: Updated info of block rdd_52_1
15/01/06 22:28:15 ERROR Executor: Exception in task 1.0 in stage 31.0 (TID 1895)
java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.spark.mllib.tree.impurity.GiniAggregator.update(Gini.scala:93)
at 
org.apache.spark.mllib.tree.impl.DTStatsAggregator.update(DTStatsAggregator.scala:100)
at 
org.apache.spark.mllib.tree.DecisionTree$.orderedBinSeqOp(DecisionTree.scala:419)
at 
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$nodeBinSeqOp$1(DecisionTree.scala:511)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:536
)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1$1.apply(DecisionTree.scala:533
)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at 
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:533)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
at 
org.apache.spark.mllib.tree.DecisionTree$$anonfun$6$$anonfun$apply$8.apply(DecisionTree.scala:628)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)

Minimal code:
 data = MLUtils.loadLibSVMFile(sc, 
'/scratch1/vivek/datasets/private/a1a').cache()

model = DecisionTree.trainClassifier(data, numClasses=2, 
categoricalFeaturesInfo={}, maxDepth=5, maxBins=100)

Just download the data from: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a1a




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5104) Distributed Representations of Sentences and Documents

2015-01-06 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-5104:
--

 Summary: Distributed Representations of Sentences and Documents
 Key: SPARK-5104
 URL: https://issues.apache.org/jira/browse/SPARK-5104
 Project: Spark
  Issue Type: Wish
  Components: ML, MLlib
Reporter: Guoqiang Li


The Paper [Distributed Representations of Sentences and 
Documents|http://arxiv.org/abs/1405.4053]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5122) Remove Shark from spark-ec2


 [ 
https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5122:

Summary: Remove Shark from spark-ec2  (was: Remove Shark from spark-ec2 
modules)

 Remove Shark from spark-ec2
 ---

 Key: SPARK-5122
 URL: https://issues.apache.org/jira/browse/SPARK-5122
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5122) Remove Shark from spark-ec2


 [ 
https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5122:

Description: Since Shark has been replaced by Spark SQL, we don't need it 
in {{spark-ec2}} anymore. (?)

 Remove Shark from spark-ec2
 ---

 Key: SPARK-5122
 URL: https://issues.apache.org/jira/browse/SPARK-5122
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} 
 anymore. (?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2


[ 
https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267281#comment-14267281
 ] 

Nicholas Chammas commented on SPARK-5122:
-

cc [~shivaram] - Is it appropriate to just remove the Shark module from 
{{spark-ec2}}?

 Remove Shark from spark-ec2
 ---

 Key: SPARK-5122
 URL: https://issues.apache.org/jira/browse/SPARK-5122
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5123) Expose only one version of the data type APIs (i.e. remove the Java-specific API)


[ 
https://issues.apache.org/jira/browse/SPARK-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267301#comment-14267301
 ] 

Apache Spark commented on SPARK-5123:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3925

 Expose only one version of the data type APIs (i.e. remove the Java-specific 
 API)
 -

 Key: SPARK-5123
 URL: https://issues.apache.org/jira/browse/SPARK-5123
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Having two versions of the data type APIs (one for Java, one for Scala) 
 requires downstream libraries to also have two versions of the APIs if the 
 library wants to support both Java and Scala. I took a look at the Scala 
 version of the data type APIs - it can actually work out pretty well for Java 
 out of the box. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception


[ 
https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267282#comment-14267282
 ] 

Apache Spark commented on SPARK-5009:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3924

 allCaseVersions function in  SqlLexical  leads to StackOverflow Exception
 -

 Key: SPARK-5009
 URL: https://issues.apache.org/jira/browse/SPARK-5009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.0
Reporter: shengli
 Fix For: 1.3.0, 1.2.1

   Original Estimate: 96h
  Remaining Estimate: 96h

 Recently I found a bug when I add new feature in SqlParser. Which is :
 If I define a KeyWord that has a long name. Like：
  ```protected val ：SERDEPROPERTIES = Keyword(SERDEPROPERTIES)```
 Since the all case version is implement by recursive function, so when  
 ```implicit asParser`` function is called  and the stack memory is very 
 small, it will leads to SO Exception. 
 java.lang.StackOverflowError
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5099) Simplify logistic loss function


 [ 
https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5099:
-
Assignee: Liang-Chi Hsieh

 Simplify logistic loss function
 ---

 Key: SPARK-5099
 URL: https://issues.apache.org/jira/browse/SPARK-5099
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 This is a minor pr where I think that we can simply take minus of margin, 
 instead of subtracting margin, in the LogisticGradient.
 Mathematically, they are equal. But the modified equation is the common form 
 of logistic loss function and so more readable. It also computes more 
 accurate value as some quick tests show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5099) Simplify logistic loss function


 [ 
https://issues.apache.org/jira/browse/SPARK-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5099.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3899
[https://github.com/apache/spark/pull/3899]

 Simplify logistic loss function
 ---

 Key: SPARK-5099
 URL: https://issues.apache.org/jira/browse/SPARK-5099
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 This is a minor pr where I think that we can simply take minus of margin, 
 instead of subtracting margin, in the LogisticGradient.
 Mathematically, they are equal. But the modified equation is the common form 
 of logistic loss function and so more readable. It also computes more 
 accurate value as some quick tests show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5122) Remove Shark from spark-ec2

2015-01-06 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267296#comment-14267296
 ] 

Shivaram Venkataraman commented on SPARK-5122:
--

Yes I think removing shark should be fine. We can also get rid of the Spark - 
Shark version map in spark_ec2.py

 Remove Shark from spark-ec2
 ---

 Key: SPARK-5122
 URL: https://issues.apache.org/jira/browse/SPARK-5122
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Since Shark has been replaced by Spark SQL, we don't need it in {{spark-ec2}} 
 anymore. (?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5124) Standardize internal RPC interface

2015-01-06 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-5124:
--

 Summary: Standardize internal RPC interface
 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu


In Spark we use Akka as the RPC layer. It would be great if we can standardize 
the internal RPC interface to facilitate testing. This will also provide the 
foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception


[ 
https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267317#comment-14267317
 ] 

Apache Spark commented on SPARK-5009:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3926

 allCaseVersions function in  SqlLexical  leads to StackOverflow Exception
 -

 Key: SPARK-5009
 URL: https://issues.apache.org/jira/browse/SPARK-5009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.0
Reporter: shengli
 Fix For: 1.3.0, 1.2.1

   Original Estimate: 96h
  Remaining Estimate: 96h

 Recently I found a bug when I add new feature in SqlParser. Which is :
 If I define a KeyWord that has a long name. Like：
  ```protected val ：SERDEPROPERTIES = Keyword(SERDEPROPERTIES)```
 Since the all case version is implement by recursive function, so when  
 ```implicit asParser`` function is called  and the stack memory is very 
 small, it will leads to SO Exception. 
 java.lang.StackOverflowError
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5121) Stored as parquet doens't support the CTAS

2015-01-06 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang closed SPARK-5121.

Resolution: Fixed

 Stored as parquet doens't support the CTAS
 --

 Key: SPARK-5121
 URL: https://issues.apache.org/jira/browse/SPARK-5121
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: hive-0.13.1
Reporter: XiaoJing wang
 Fix For: 1.2.0

   Original Estimate: 4h
  Remaining Estimate: 4h

 In the CTAS, stored as parquet  is an unsupported Hive feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations


 [ 
https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-4948.
-
Resolution: Fixed

Resolved by: https://github.com/mesos/spark-ec2/pull/86

 Use pssh instead of bash-isms and remove unnecessary operations
 ---

 Key: SPARK-4948
 URL: https://issues.apache.org/jira/browse/SPARK-4948
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary 
 SSH calls to pre-approve keys.
 Replace bash-isms like {{while ... command ...  wait}} with {{pssh}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations


 [ 
https://issues.apache.org/jira/browse/SPARK-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-4948:

Target Version/s: 1.3.0

 Use pssh instead of bash-isms and remove unnecessary operations
 ---

 Key: SPARK-4948
 URL: https://issues.apache.org/jira/browse/SPARK-4948
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 Remove unnecessarily high sleep times in {{setup.sh}}, as well as unnecessary 
 SSH calls to pre-approve keys.
 Replace bash-isms like {{while ... command ...  wait}} with {{pssh}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4948) Use pssh instead of bash-isms and remove unnecessary operations