[jira] [Commented] (SPARK-9285) Remove InternalRow's inheritance from Row

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639456#comment-14639456
 ] 

Apache Spark commented on SPARK-9285:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7626

 Remove InternalRow's inheritance from Row
 -

 Key: SPARK-9285
 URL: https://issues.apache.org/jira/browse/SPARK-9285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It is a big change, but it lets us use the type information to prevent 
 accidentally passing internal types to external types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9285) Remove InternalRow's inheritance from Row

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9285:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Remove InternalRow's inheritance from Row
 -

 Key: SPARK-9285
 URL: https://issues.apache.org/jira/browse/SPARK-9285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 It is a big change, but it lets us use the type information to prevent 
 accidentally passing internal types to external types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9285) Remove InternalRow's inheritance from Row

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9285:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Remove InternalRow's inheritance from Row
 -

 Key: SPARK-9285
 URL: https://issues.apache.org/jira/browse/SPARK-9285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It is a big change, but it lets us use the type information to prevent 
 accidentally passing internal types to external types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9287) Speedup unit test of Date expressions

2015-07-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-9287:
-

 Summary: Speedup unit test of Date expressions
 Key: SPARK-9287
 URL: https://issues.apache.org/jira/browse/SPARK-9287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


It tried hard to cover many corner cases, but slow down unit tests a lot (take 
30 seconds on my Macbook). 

We could ignore most of them now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9286) Methods in Unevaluable should be final

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9286:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Methods in Unevaluable should be final
 --

 Key: SPARK-9286
 URL: https://issues.apache.org/jira/browse/SPARK-9286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Trivial

 The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait 
 should be marked as {{final}} and we should fix any cases where they are 
 overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9284) Remove tests' dependency on the assembly

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9284:
---

Assignee: Apache Spark

 Remove tests' dependency on the assembly
 

 Key: SPARK-9284
 URL: https://issues.apache.org/jira/browse/SPARK-9284
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Marcelo Vanzin
Assignee: Apache Spark
Priority: Minor

 Some tests - in particular tests that have to spawn child processes - 
 currently rely on the generated Spark assembly to run properly.
 This is sub-optimal for a few reasons:
 - Users have to use an unnatural package everything first, then run tests 
 approach
 - Sometimes tests are run using old code because the user forgot to rebuild 
 the assembly
 The latter is particularly annoying in {{YarnClusterSuite}}. If you modify 
 some code outside of the {{yarn/}} module, you have to rebuild the whole 
 assembly before that test picks it up.
 We should make all tests run without the need to have an assembly around, 
 making sure that they always pick up the latest code compiled by the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9284) Remove tests' dependency on the assembly

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9284:
---

Assignee: (was: Apache Spark)

 Remove tests' dependency on the assembly
 

 Key: SPARK-9284
 URL: https://issues.apache.org/jira/browse/SPARK-9284
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Marcelo Vanzin
Priority: Minor

 Some tests - in particular tests that have to spawn child processes - 
 currently rely on the generated Spark assembly to run properly.
 This is sub-optimal for a few reasons:
 - Users have to use an unnatural package everything first, then run tests 
 approach
 - Sometimes tests are run using old code because the user forgot to rebuild 
 the assembly
 The latter is particularly annoying in {{YarnClusterSuite}}. If you modify 
 some code outside of the {{yarn/}} module, you have to rebuild the whole 
 assembly before that test picks it up.
 We should make all tests run without the need to have an assembly around, 
 making sure that they always pick up the latest code compiled by the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9284) Remove tests' dependency on the assembly

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639671#comment-14639671
 ] 

Apache Spark commented on SPARK-9284:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7629

 Remove tests' dependency on the assembly
 

 Key: SPARK-9284
 URL: https://issues.apache.org/jira/browse/SPARK-9284
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Marcelo Vanzin
Priority: Minor

 Some tests - in particular tests that have to spawn child processes - 
 currently rely on the generated Spark assembly to run properly.
 This is sub-optimal for a few reasons:
 - Users have to use an unnatural package everything first, then run tests 
 approach
 - Sometimes tests are run using old code because the user forgot to rebuild 
 the assembly
 The latter is particularly annoying in {{YarnClusterSuite}}. If you modify 
 some code outside of the {{yarn/}} module, you have to rebuild the whole 
 assembly before that test picks it up.
 We should make all tests run without the need to have an assembly around, 
 making sure that they always pick up the latest code compiled by the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9290) DateExpressionsSuite is slow to run

2015-07-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9290:
--

 Summary: DateExpressionsSuite is slow to run
 Key: SPARK-9290
 URL: https://issues.apache.org/jira/browse/SPARK-9290
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


We are running way too many test cases in here.

{code}
[info] - DayOfYear (16 seconds, 998 milliseconds)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9286) Methods in Unevaluable should be final

2015-07-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9286.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7627
[https://github.com/apache/spark/pull/7627]

 Methods in Unevaluable should be final
 --

 Key: SPARK-9286
 URL: https://issues.apache.org/jira/browse/SPARK-9286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Trivial
 Fix For: 1.5.0


 The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait 
 should be marked as {{final}} and we should fix any cases where they are 
 overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9286) Methods in Unevaluable should be final

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639507#comment-14639507
 ] 

Apache Spark commented on SPARK-9286:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7627

 Methods in Unevaluable should be final
 --

 Key: SPARK-9286
 URL: https://issues.apache.org/jira/browse/SPARK-9286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Trivial

 The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait 
 should be marked as {{final}} and we should fix any cases where they are 
 overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9286) Methods in Unevaluable should be final

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9286:
---

Assignee: Josh Rosen  (was: Apache Spark)

 Methods in Unevaluable should be final
 --

 Key: SPARK-9286
 URL: https://issues.apache.org/jira/browse/SPARK-9286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Trivial

 The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait 
 should be marked as {{final}} and we should fix any cases where they are 
 overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9269) Add Set to the matching type in ArrayConverter

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639603#comment-14639603
 ] 

Apache Spark commented on SPARK-9269:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/7628

 Add Set to the matching  type in ArrayConverter
 ---

 Key: SPARK-9269
 URL: https://issues.apache.org/jira/browse/SPARK-9269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Alex Liu

 When the data is scala set, the following error is thrown.
 {code}
 scala.MatchError: Set() (of class scala.collection.immutable.Set$EmptySet$)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:136)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187)
   at 
 org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:62)
   at 
 org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
   at 
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 {code}
 We need add Set to the matching type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9269) Add Set to the matching type in ArrayConverter

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9269:
---

Assignee: Apache Spark

 Add Set to the matching  type in ArrayConverter
 ---

 Key: SPARK-9269
 URL: https://issues.apache.org/jira/browse/SPARK-9269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Alex Liu
Assignee: Apache Spark

 When the data is scala set, the following error is thrown.
 {code}
 scala.MatchError: Set() (of class scala.collection.immutable.Set$EmptySet$)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:136)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187)
   at 
 org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:62)
   at 
 org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
   at 
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 {code}
 We need add Set to the matching type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9269) Add Set to the matching type in ArrayConverter

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9269:
---

Assignee: (was: Apache Spark)

 Add Set to the matching  type in ArrayConverter
 ---

 Key: SPARK-9269
 URL: https://issues.apache.org/jira/browse/SPARK-9269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Alex Liu

 When the data is scala set, the following error is thrown.
 {code}
 scala.MatchError: Set() (of class scala.collection.immutable.Set$EmptySet$)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:136)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187)
   at 
 org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$4.apply(CatalystTypeConverters.scala:187)
   at 
 org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:62)
   at 
 org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
   at 
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 {code}
 We need add Set to the matching type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9285) Remove InternalRow's inheritance from Row

2015-07-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9285:
--

 Summary: Remove InternalRow's inheritance from Row
 Key: SPARK-9285
 URL: https://issues.apache.org/jira/browse/SPARK-9285
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


It is a big change, but it lets us use the type information to prevent 
accidentally passing internal types to external types.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9284) Remove tests' dependency on the assembly

2015-07-23 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-9284:
-

 Summary: Remove tests' dependency on the assembly
 Key: SPARK-9284
 URL: https://issues.apache.org/jira/browse/SPARK-9284
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Marcelo Vanzin
Priority: Minor


Some tests - in particular tests that have to spawn child processes - currently 
rely on the generated Spark assembly to run properly.

This is sub-optimal for a few reasons:
- Users have to use an unnatural package everything first, then run tests 
approach
- Sometimes tests are run using old code because the user forgot to rebuild the 
assembly

The latter is particularly annoying in {{YarnClusterSuite}}. If you modify some 
code outside of the {{yarn/}} module, you have to rebuild the whole assembly 
before that test picks it up.

We should make all tests run without the need to have an assembly around, 
making sure that they always pick up the latest code compiled by the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9284) Remove tests' dependency on the assembly

2015-07-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639448#comment-14639448
 ] 

Marcelo Vanzin commented on SPARK-9284:
---

BTW I have a patch that does this, I'm currently running some more test 
iterations on it.

 Remove tests' dependency on the assembly
 

 Key: SPARK-9284
 URL: https://issues.apache.org/jira/browse/SPARK-9284
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Marcelo Vanzin
Priority: Minor

 Some tests - in particular tests that have to spawn child processes - 
 currently rely on the generated Spark assembly to run properly.
 This is sub-optimal for a few reasons:
 - Users have to use an unnatural package everything first, then run tests 
 approach
 - Sometimes tests are run using old code because the user forgot to rebuild 
 the assembly
 The latter is particularly annoying in {{YarnClusterSuite}}. If you modify 
 some code outside of the {{yarn/}} module, you have to rebuild the whole 
 assembly before that test picks it up.
 We should make all tests run without the need to have an assembly around, 
 making sure that they always pick up the latest code compiled by the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9288) Improve test speed

2015-07-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9288:
--

 Summary: Improve test speed
 Key: SPARK-9288
 URL: https://issues.apache.org/jira/browse/SPARK-9288
 Project: Spark
  Issue Type: Umbrella
  Components: Build, Tests
Reporter: Reynold Xin


This is an umbrella ticket to track test cases that are slow to run.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9289) OrcPartitionDiscoverySuite is slow to run

2015-07-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9289:
--

 Summary: OrcPartitionDiscoverySuite is slow to run
 Key: SPARK-9289
 URL: https://issues.apache.org/jira/browse/SPARK-9289
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


{code}
[info] - read partitioned table - normal case (18 seconds, 557 milliseconds)
[info] - read partitioned table - partition key included in orc file (5 
seconds, 160 milliseconds)
[info] - read partitioned table - with nulls (4 seconds, 69 milliseconds)
[info] - read partitioned table - with nulls and partition keys are included in 
Orc file (3 seconds, 218 milliseconds)
{code}

Does the unit test really need to run for 18 secs, 5 secs, 4 secs, and 3 secs?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9276) ThriftServer process can't stop if using command yarn application -kill appid

2015-07-23 Thread meiyoula (JIRA)
meiyoula created SPARK-9276:
---

 Summary: ThriftServer process can't stop if using command yarn 
application -kill appid
 Key: SPARK-9276
 URL: https://issues.apache.org/jira/browse/SPARK-9276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula



Reproduction Steps:
1. starting thriftserver
2. using beeline to connect thriftserver
3.using commad “yarn application -kill appid” or from yarn webui to kill the 
application of thriftserver
4.ApplicationMaster has stopped, but the driver process will always be there

Reproduction Condition: There must have client connect to thriftserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5447) Replace reference to SchemaRDD with DataFrame

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638645#comment-14638645
 ] 

Apache Spark commented on SPARK-5447:
-

User 'darroyocazorla' has created a pull request for this issue:
https://github.com/apache/spark/pull/7618

 Replace reference to SchemaRDD with DataFrame
 -

 Key: SPARK-5447
 URL: https://issues.apache.org/jira/browse/SPARK-5447
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 We renamed SchemaRDD - DataFrame, but internally various code still 
 reference SchemaRDD in JavaDoc and comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght

2015-07-23 Thread Andrey Vykhodtsev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Vykhodtsev updated SPARK-9277:
-
Attachment: SparseVector test.ipynb
SparseVector test.html


Attached is the notebook with the scenario and the full message:

 SparseVector constructor must throw an error when declared number of elements 
 less than array lenght
 

 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor
 Attachments: SparseVector test.html, SparseVector test.ipynb


 I found that one can create SparseVector inconsistently and it will lead to 
 an Java error in runtime, for example when training LogisticRegressionWithSGD.
 Here is the test case:
 In [2]:
 sc.version
 Out[2]:
 u'1.3.1'
 In [13]:
 from pyspark.mllib.linalg import SparseVector
 from pyspark.mllib.regression import LabeledPoint
 from pyspark.mllib.classification import LogisticRegressionWithSGD
 In [3]:
 x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
 In [10]:
 l = LabeledPoint(0, x)
 In [12]:
 r = sc.parallelize([l])
 In [14]:
 m = LogisticRegressionWithSGD.train(r)
 Error:
 Py4JJavaError: An error occurred while calling 
 o86.trainLogisticRegressionModelWithSGD.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 
 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2
 Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9212) update Netty version to 4.0.29.Final for Netty Metrics

2015-07-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9212:
-
Assignee: Zhang, Liye
Priority: Trivial  (was: Major)

 update Netty version to 4.0.29.Final for Netty Metrics
 

 Key: SPARK-9212
 URL: https://issues.apache.org/jira/browse/SPARK-9212
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Zhang, Liye
Assignee: Zhang, Liye
Priority: Trivial
 Fix For: 1.5.0


 In Netty version 4.0.29.Final, metrics for PooledByteBufAllocator is exposed 
 directly, so that no need to get the memory data info in a hack way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2015-07-23 Thread Steve Lindemann (JIRA)
Steve Lindemann created SPARK-9278:
--

 Summary: DataFrameWriter.insertInto inserts incorrect data
 Key: SPARK-9278
 URL: https://issues.apache.org/jira/browse/SPARK-9278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Linux, S3, Hive Metastore
Reporter: Steve Lindemann


After creating a partitioned Hive table (stored as Parquet) via the 
DataFrameWriter.createTable command, subsequent attempts to insert additional 
data into new partitions of this table result in inserting incorrect data rows. 
Reordering the columns in the data to be written seems to avoid this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array lenght

2015-07-23 Thread Andrey Vykhodtsev (JIRA)
Andrey Vykhodtsev created SPARK-9277:


 Summary: SparseVector constructor must throw an error when 
declared number of elements less than array lenght
 Key: SPARK-9277
 URL: https://issues.apache.org/jira/browse/SPARK-9277
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Andrey Vykhodtsev
Priority: Minor


I found that one can create SparseVector inconsistently and it will lead to an 
Java error in runtime, for example when training LogisticRegressionWithSGD.

Here is the test case:


In [2]:
sc.version
Out[2]:
u'1.3.1'
In [13]:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithSGD
In [3]:
x =  SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5})
In [10]:
l = LabeledPoint(0, x)
In [12]:
r = sc.parallelize([l])
In [14]:
m = LogisticRegressionWithSGD.train(r)

Error:


Py4JJavaError: An error occurred while calling 
o86.trainLogisticRegressionModelWithSGD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 
(TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2


Attached is the notebook with the scenario and the full message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4024) Remember user preferences for metrics to show in the UI

2015-07-23 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-4024.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Resolved in https://github.com/apache/spark/pull/7399

 Remember user preferences for metrics to show in the UI
 ---

 Key: SPARK-4024
 URL: https://issues.apache.org/jira/browse/SPARK-4024
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Priority: Minor
 Fix For: 1.5.0


 We should remember the metrics a user has previously chosen to display for 
 each stage, so that the user doesn't need to reselect interesting metric each 
 time they open a stage detail page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9212) update Netty version to 4.0.29.Final for Netty Metrics

2015-07-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9212.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7562
[https://github.com/apache/spark/pull/7562]

 update Netty version to 4.0.29.Final for Netty Metrics
 

 Key: SPARK-9212
 URL: https://issues.apache.org/jira/browse/SPARK-9212
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Zhang, Liye
 Fix For: 1.5.0


 In Netty version 4.0.29.Final, metrics for PooledByteBufAllocator is exposed 
 directly, so that no need to get the memory data info in a hack way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9276) ThriftServer process can't stop if using command yarn application -kill appid

2015-07-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638714#comment-14638714
 ] 

Sean Owen commented on SPARK-9276:
--

I might misunderstand this, but, is that surprising? you used YARN to kill the 
AM, and the AM stopped. YARN can't kill other processes.

 ThriftServer process can't stop if using command yarn application -kill 
 appid
 ---

 Key: SPARK-9276
 URL: https://issues.apache.org/jira/browse/SPARK-9276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula

 Reproduction Steps:
 1. starting thriftserver
 2. using beeline to connect thriftserver
 3.using commad “yarn application -kill appid” or from yarn webui to kill the 
 application of thriftserver
 4.ApplicationMaster has stopped, but the driver process will always be there
 Reproduction Condition: There must have client connect to thriftserver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9249) local variable assigned but may not be used

2015-07-23 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-9249:
---
Description: 
local variable assigned but may not be used

For example:

{noformat}
R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be 
used
  data - readBin(con, raw(), as.integer(dataLen), endian = big)
  ^~~~
R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be 
used
  data - readBin(con, raw(), as.integer(dataLen), endian = big)
  ^~~~
{noformat}

  was:local variable assigned but may not be used


 local variable assigned but may not be used
 ---

 Key: SPARK-9249
 URL: https://issues.apache.org/jira/browse/SPARK-9249
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor

 local variable assigned but may not be used
 For example:
 {noformat}
 R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8092) OneVsRest doesn't allow flexibility in label/ feature column renaming

2015-07-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-8092.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 OneVsRest doesn't allow flexibility in label/ feature column renaming
 -

 Key: SPARK-8092
 URL: https://issues.apache.org/jira/browse/SPARK-8092
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9249) local variable assigned but may not be used

2015-07-23 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639945#comment-14639945
 ] 

Yu Ishikawa edited comment on SPARK-9249 at 7/24/15 5:22 AM:
-

[~chanchal.spark] Yes. I think we should remove local variables which are not 
used, such as below.
https://github.com/apache/spark/blob/branch-1.4/R/pkg/R/deserialize.R#L104


was (Author: yuu.ishik...@gmail.com):
[~chanchal.spark] Yes. I think we should remove local variables which is not 
used, such as below.
https://github.com/apache/spark/blob/branch-1.4/R/pkg/R/deserialize.R#L104

 local variable assigned but may not be used
 ---

 Key: SPARK-9249
 URL: https://issues.apache.org/jira/browse/SPARK-9249
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor

 local variable assigned but may not be used
 For example:
 {noformat}
 R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7045) Word2Vec: avoid intermediate representation when creating model

2015-07-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7045:
-
Shepherd: Joseph K. Bradley
Target Version/s: 1.5.0

 Word2Vec: avoid intermediate representation when creating model
 ---

 Key: SPARK-7045
 URL: https://issues.apache.org/jira/browse/SPARK-7045
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Priority: Minor

 Word2VecModel now stores the word vectors as a single, flat array; Word2Vec 
 does as well.  However, when Word2Vec creates the model, it builds an 
 intermediate representation.  We should skip that intermediate representation.
 However, it will be nice to create a public constructor for Word2VecModel 
 which takes that intermediate representation (a Map from String words to 
 their Vectors), since it's a user-friendly representation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7045) Word2Vec: avoid intermediate representation when creating model

2015-07-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7045:
-
Assignee: Manoj Kumar

 Word2Vec: avoid intermediate representation when creating model
 ---

 Key: SPARK-7045
 URL: https://issues.apache.org/jira/browse/SPARK-7045
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Priority: Minor

 Word2VecModel now stores the word vectors as a single, flat array; Word2Vec 
 does as well.  However, when Word2Vec creates the model, it builds an 
 intermediate representation.  We should skip that intermediate representation.
 However, it will be nice to create a public constructor for Word2VecModel 
 which takes that intermediate representation (a Map from String words to 
 their Vectors), since it's a user-friendly representation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9293:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Analysis should detect when set operations are performed on tables with 
 different numbers of columns
 

 Key: SPARK-9293
 URL: https://issues.apache.org/jira/browse/SPARK-9293
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Apache Spark

 Our SQL analyzer doesn't always enforce that set operations are only 
 performed on relations with the same number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9293:
---

Assignee: Josh Rosen  (was: Apache Spark)

 Analysis should detect when set operations are performed on tables with 
 different numbers of columns
 

 Key: SPARK-9293
 URL: https://issues.apache.org/jira/browse/SPARK-9293
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 Our SQL analyzer doesn't always enforce that set operations are only 
 performed on relations with the same number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8564) Add the Python API for Kinesis

2015-07-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-8564:
---
Target Version/s: 1.5.0

 Add the Python API for Kinesis
 --

 Key: SPARK-8564
 URL: https://issues.apache.org/jira/browse/SPARK-8564
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5373) literal in agg grouping expressioons leads to incorrect result

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639897#comment-14639897
 ] 

Apache Spark commented on SPARK-5373:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7583

  literal in agg grouping expressioons leads to incorrect result
 ---

 Key: SPARK-5373
 URL: https://issues.apache.org/jira/browse/SPARK-5373
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Fei Wang
 Fix For: 1.3.0


 select key, count( * ) from src group by key, 1 will get the wrong answer!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions

2015-07-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639914#comment-14639914
 ] 

Yin Huai commented on SPARK-6548:
-

[~JihongMA] Will you have time to implement stddev based on our new aggregate 
function interface? {{AlgebraicAggregate}} is the abstract class to use and you 
can take a look at 
{{org.apache.spark.sql.catalyst.expressions.aggregate.Average}} as an example. 
Let me know if you have any question. Thanks!

 stddev_pop and stddev_samp aggregate functions
 --

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter

 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9255) Timestamp handling incorrect for Spark 1.4.1 on Linux

2015-07-23 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639949#comment-14639949
 ] 

Paul Wu commented on SPARK-9255:


[~srowen]   I don't think it is due to version difference: The same code runs 
on Release 1.3.0 correctly on Red Linux.  This bug was introduced after 1.3.0.  

 Timestamp handling incorrect for Spark 1.4.1 on Linux
 -

 Key: SPARK-9255
 URL: https://issues.apache.org/jira/browse/SPARK-9255
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Redhat Linux, Java 8.0 and Spark 1.4.1 release.
Reporter: Paul Wu
 Attachments: timestamp_bug.zip


 This is a very strange case involving timestamp  I can run the program on 
 Windows using dev pom.xml (1.4.1) or 1.4.1 or 1.3.0 release downloaded from 
 Apache  without issues , but when I ran it on Spark 1.4.1 release either 
 downloaded from Apache or the version built with scala 2.11 on redhat linux, 
 it has the following error (the code I used is after this stack trace):
 15/07/22 12:02:50  ERROR Executor 96: Exception in task 0.0 in stage 0.0 (TID 
 0)
 java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: 
 reflective compilation has failed:
 value  is not a member of TimestampType.this.InternalType
 at 
 org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
 at 
 org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
 at 
 org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 at 
 org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at 
 org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
 at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
 at 
 org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at 
 org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:105)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:102)
 at 
 org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:170)
 at 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:261)
 at 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$9.apply(GeneratedAggregate.scala:246)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: scala.tools.reflect.ToolBoxError: reflective compilation has 
 failed:
 value  is not a member of TimestampType.this.InternalType
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.throwIfErrors(ToolBoxFactory.scala:316)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.wrapInPackageAndCompile(ToolBoxFactory.scala:198)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$ToolBoxGlobal.compile(ToolBoxFactory.scala:252)
 at 
 scala.tools.reflect.ToolBoxFactory$ToolBoxImpl$$anonfun$compile$2.apply(ToolBoxFactory.scala:429)
 at 
 

[jira] [Commented] (SPARK-9249) local variable assigned but may not be used

2015-07-23 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639945#comment-14639945
 ] 

Yu Ishikawa commented on SPARK-9249:


[~chanchal.spark] Yes. I think we should remove local variables which is not 
used, such as below.
https://github.com/apache/spark/blob/branch-1.4/R/pkg/R/deserialize.R#L104

 local variable assigned but may not be used
 ---

 Key: SPARK-9249
 URL: https://issues.apache.org/jira/browse/SPARK-9249
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor

 local variable assigned but may not be used
 For example:
 {noformat}
 R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9002) KryoSerializer initialization does not include 'Array[Int]'

2015-07-23 Thread Randy Kerber (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639940#comment-14639940
 ] 

Randy Kerber commented on SPARK-9002:
-

That's what I was thinking when I created the Issue.  Looking at KryoSerializer 
object's toRegister(.) method it looked like Array[Int] class was just 
inadvertently missed, so I thought this might be a perfect simple fix for a 
first contribution.

But then as I continued my attempt to transition from Java serializer to Kryo, 
I kept hitting new Class is not registered errors, one after one.  First 
Array[String].  Then Array[Map.empty], then Array[Seq.empty], then 
Array[TreeMap], plus several other flavors of empty collection classes, 
Array[Tuple3], DataFrame, Row, even Array[GenericRowWithSchema].  Like 
swatting cockroaches -- no end to it.  Started to wonder if this was a futile 
process.

I could use some  guidance here as to how it makes sense to proceed.  Is it 
worthwhile to add the 15-20 classes I've found so far, knowing there will 
almost certainly be more?  Or drop it, because this route cannot possibly be a 
complete fix, and/or a more comprehensive solution for Kryo is already in the 
works?

 KryoSerializer initialization does not include 'Array[Int]'
 ---

 Key: SPARK-9002
 URL: https://issues.apache.org/jira/browse/SPARK-9002
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: MacBook Pro, OS X 10.10.4, Spark 1.4.0, master=local[*], 
 IntelliJ IDEA.
Reporter: Randy Kerber
Priority: Minor
  Labels: easyfix, newbie
   Original Estimate: 1h
  Remaining Estimate: 1h

 The object KryoSerializer (inside KryoRegistrator.scala) contains a list of 
 classes that are automatically registered with Kryo.  That list includes:
 Array\[Byte], Array\[Long], and Array\[Short].  Array\[Int] is missing from 
 that list.  Can't think of any good reason it shouldn't also be included.
 Note: This is first time creating an issue or contributing code to an apache 
 project. Apologies if I'm not following the process correct. Appreciate any 
 guidance or assistance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9302) collect()/head() failed with JSON of some format

2015-07-23 Thread Sun Rui (JIRA)
Sun Rui created SPARK-9302:
--

 Summary: collect()/head() failed with JSON of some format
 Key: SPARK-9302
 URL: https://issues.apache.org/jira/browse/SPARK-9302
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1, 1.4.0
Reporter: Sun Rui


Reported in the mailing list by Exie tfind...@prodevelop.com.au:
{noformat}
A sample record in raw JSON looks like this:
{version: 1,event: view,timestamp: 1427846422377,system:
DCDS,asset: 6404476,assetType: myType,assetCategory:
myCategory,extras: [{name: videoSource,value: mySource},{name:
playerType,value: Article},{name: duration,value:
202088}],trackingId: 155629a0-d802-11e4-13ee-6884e43d6000,ipAddress:
165.69.2.4,title: myTitle}

 head(mydf)
Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class jobj to a data.frame

 show(mydf)
DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
assetType:string, event:string, 
extras:arraystructlt;name:string,value:string, ipAddress:string, 
memberId:string, system:string, timestamp:bigint, title:string, 
trackingId:string, version:bigint]

{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9249) local variable assigned but may not be used

2015-07-23 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639955#comment-14639955
 ] 

Yu Ishikawa commented on SPARK-9249:


I'm working this issue.

 local variable assigned but may not be used
 ---

 Key: SPARK-9249
 URL: https://issues.apache.org/jira/browse/SPARK-9249
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Priority: Minor

 local variable assigned but may not be used
 For example:
 {noformat}
 R/deserialize.R:105:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 R/deserialize.R:109:3: warning: local variable ‘data’ assigned but may not be 
 used
   data - readBin(con, raw(), as.integer(dataLen), endian = big)
   ^~~~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9281) Parse literals as decimal in SQL

2015-07-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-9281:
-

Assignee: Davies Liu

 Parse literals as decimal in SQL
 

 Key: SPARK-9281
 URL: https://issues.apache.org/jira/browse/SPARK-9281
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu

 Right now, we use double to parse all the float number in SQL. When it's used 
 in expression together with DecimalType, it will turn the decimal into double 
 as well.
 Also it will loss some precision when using double.
 It's better to parse the float number as decimal (we will know exactly the 
 precision and scale is), it also work well with double.
 BTW, this is a break change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9291) Conversion is applied twice on partitioned data sources

2015-07-23 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9291:
--

 Summary: Conversion is applied twice on partitioned data sources
 Key: SPARK-9291
 URL: https://issues.apache.org/jira/browse/SPARK-9291
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker


We currently apply conversion twice: once in DataSourceStrategy (search for 
toCatalystRDD), and another in HadoopFsRelation.buildScan (search for 
rowToRowRdd).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8906) Move all internal data source related classes out of sources package

2015-07-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8906.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Move all internal data source related classes out of sources package
 

 Key: SPARK-8906
 URL: https://issues.apache.org/jira/browse/SPARK-8906
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 Move all of them into execution package for better private visibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9207) Turn on Parquet filter push-down by default

2015-07-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9207.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Turn on Parquet filter push-down by default
 ---

 Key: SPARK-9207
 URL: https://issues.apache.org/jira/browse/SPARK-9207
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.5.0


 We turned off Parquet filter push-down by default in Spark 1.4.0 and prior 
 versions because of some Parquet side bugs in Parquet 1.6.0rc3. Now we've 
 upgraded to 1.7.0, which fixed all those bugs. Should turn on Parquet filter 
 push-down by default now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7446) Inverse transform for StringIndexer

2015-07-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7446:
-
Shepherd: Joseph K. Bradley

 Inverse transform for StringIndexer
 ---

 Key: SPARK-7446
 URL: https://issues.apache.org/jira/browse/SPARK-7446
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: holdenk
Priority: Minor

 It is useful to convert the encoded indices back to their string 
 representation for result inspection. We can add a parameter to 
 StringIndexer/StringIndexModel for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639830#comment-14639830
 ] 

Apache Spark commented on SPARK-9294:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7619

 cleanup comments, code style, naming typo for the new aggregation
 -

 Key: SPARK-9294
 URL: https://issues.apache.org/jira/browse/SPARK-9294
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9299) percentile and percentile_approx aggregate functions

2015-07-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9299:
---

 Summary: percentile and percentile_approx aggregate functions
 Key: SPARK-9299
 URL: https://issues.apache.org/jira/browse/SPARK-9299
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


A short introduction on how to build aggregate functions based on our new 
interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-07-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9296:

Description: A short introduction on how to build aggregate functions based 
on our new interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.

 variance, var_pop, and var_samp aggregate functions
 ---

 Key: SPARK-9296
 URL: https://issues.apache.org/jira/browse/SPARK-9296
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai

 A short introduction on how to build aggregate functions based on our new 
 interface can be found at 
 https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9298) corr aggregate functions

2015-07-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9298:
---

 Summary: corr aggregate functions
 Key: SPARK-9298
 URL: https://issues.apache.org/jira/browse/SPARK-9298
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


A short introduction on how to build aggregate functions based on our new 
interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9297) covar_pop and covar_samp aggregate functions

2015-07-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9297:
---

 Summary: covar_pop and covar_samp aggregate functions
 Key: SPARK-9297
 URL: https://issues.apache.org/jira/browse/SPARK-9297
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


A short introduction on how to build aggregate functions based on our new 
interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9292) Analysis should check that join conditions' data types are booleans

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9292:
---

Assignee: Josh Rosen  (was: Apache Spark)

 Analysis should check that join conditions' data types are booleans
 ---

 Key: SPARK-9292
 URL: https://issues.apache.org/jira/browse/SPARK-9292
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 The following data frame query should fail analysis but instead fails at 
 runtime:
 {code}
 val df = Seq((1, 1)).toDF(a, b)
 df.join(df, df.col(a))
 {code}
 This should fail with an AnalysisException because the column A is not a 
 boolean and thus cannot be used as a join condition.
 This can be fixed by adding a new analysis rule which checks that the join 
 condition has BooleanType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9122) spark.mllib regression should support batch predict

2015-07-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9122.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7614
[https://github.com/apache/spark/pull/7614]

 spark.mllib regression should support batch predict
 ---

 Key: SPARK-9122
 URL: https://issues.apache.org/jira/browse/SPARK-9122
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, in spark.mllib, generalized linear regression models like 
 LinearRegressionModel, RidgeRegressionModel and LassoModel support predict() 
 via: LinearRegressionModelBase.predict, which only takes single rows (feature 
 vectors).
 It should support batch prediction, taking an RDD.  (See other classes which 
 do this already such as NaiveBayesModel.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering

2015-07-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9222:
-
Target Version/s: 1.5.0

 Make class instantiation variables in DistributedLDAModel [private] clustering
 --

 Key: SPARK-9222
 URL: https://issues.apache.org/jira/browse/SPARK-9222
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Manoj Kumar
Assignee: Manoj Kumar
Priority: Minor

 This would enable testing the various class variables like docConcentration, 
 topicConcentration etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9296) variance, var_pop, and var_samp aggregate functions

2015-07-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9296:
---

 Summary: variance, var_pop, and var_samp aggregate functions
 Key: SPARK-9296
 URL: https://issues.apache.org/jira/browse/SPARK-9296
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9300) histogram_numeric aggregate function

2015-07-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9300:
---

 Summary: histogram_numeric aggregate function
 Key: SPARK-9300
 URL: https://issues.apache.org/jira/browse/SPARK-9300
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


A short introduction on how to build aggregate functions based on our new 
interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7650) Move streaming css and js files to the streaming project

2015-07-23 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-7650.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

 Move streaming css and js files to the streaming project
 

 Key: SPARK-7650
 URL: https://issues.apache.org/jira/browse/SPARK-7650
 Project: Spark
  Issue Type: Improvement
  Components: Streaming, Web UI
Reporter: Shixiong Zhu
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9301) collect_set and collect_list aggregate functions

2015-07-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9301:
---

 Summary: collect_set and collect_list aggregate functions
 Key: SPARK-9301
 URL: https://issues.apache.org/jira/browse/SPARK-9301
 Project: Spark
  Issue Type: Sub-task
Reporter: Yin Huai


A short introduction on how to build aggregate functions based on our new 
interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9301) collect_set and collect_list aggregate functions

2015-07-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9301:

Target Version/s: 1.5.0

 collect_set and collect_list aggregate functions
 

 Key: SPARK-9301
 URL: https://issues.apache.org/jira/browse/SPARK-9301
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai

 A short introduction on how to build aggregate functions based on our new 
 interface can be found at 
 https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9292) Analysis should check that join conditions' data types are booleans

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9292:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Analysis should check that join conditions' data types are booleans
 ---

 Key: SPARK-9292
 URL: https://issues.apache.org/jira/browse/SPARK-9292
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Apache Spark

 The following data frame query should fail analysis but instead fails at 
 runtime:
 {code}
 val df = Seq((1, 1)).toDF(a, b)
 df.join(df, df.col(a))
 {code}
 This should fail with an AnalysisException because the column A is not a 
 boolean and thus cannot be used as a join condition.
 This can be fixed by adding a new analysis rule which checks that the join 
 condition has BooleanType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9292) Analysis should check that join conditions' data types are booleans

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639756#comment-14639756
 ] 

Apache Spark commented on SPARK-9292:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7630

 Analysis should check that join conditions' data types are booleans
 ---

 Key: SPARK-9292
 URL: https://issues.apache.org/jira/browse/SPARK-9292
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 The following data frame query should fail analysis but instead fails at 
 runtime:
 {code}
 val df = Seq((1, 1)).toDF(a, b)
 df.join(df, df.col(a))
 {code}
 This should fail with an AnalysisException because the column A is not a 
 boolean and thus cannot be used as a join condition.
 This can be fixed by adding a new analysis rule which checks that the join 
 condition has BooleanType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9295) Analysis should detect sorting on unsupported column types

2015-07-23 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9295:
-

 Summary: Analysis should detect sorting on unsupported column types
 Key: SPARK-9295
 URL: https://issues.apache.org/jira/browse/SPARK-9295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


The SQL analyzer should report errors for queries that try to sort on columns 
of unsupported types, such as ArrayType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9292) Analysis should check that join conditions' data types are booleans

2015-07-23 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9292:
-

 Summary: Analysis should check that join conditions' data types 
are booleans
 Key: SPARK-9292
 URL: https://issues.apache.org/jira/browse/SPARK-9292
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


The following data frame query should fail analysis but instead fails at 
runtime:

{code}
val df = Seq((1, 1)).toDF(a, b)
df.join(df, df.col(a))
{code}

This should fail with an AnalysisException because the column A is not a 
boolean and thus cannot be used as a join condition.

This can be fixed by adding a new analysis rule which checks that the join 
condition has BooleanType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns

2015-07-23 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9293:
-

 Summary: Analysis should detect when set operations are performed 
on tables with different numbers of columns
 Key: SPARK-9293
 URL: https://issues.apache.org/jira/browse/SPARK-9293
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


Our SQL analyzer doesn't always enforce that set operations are only performed 
on relations with the same number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9294:
---

Assignee: (was: Apache Spark)

 cleanup comments, code style, naming typo for the new aggregation
 -

 Key: SPARK-9294
 URL: https://issues.apache.org/jira/browse/SPARK-9294
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9294:
---

Assignee: Apache Spark

 cleanup comments, code style, naming typo for the new aggregation
 -

 Key: SPARK-9294
 URL: https://issues.apache.org/jira/browse/SPARK-9294
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9294) cleanup comments, code style, naming typo for the new aggregation

2015-07-23 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9294:
--

 Summary: cleanup comments, code style, naming typo for the new 
aggregation
 Key: SPARK-9294
 URL: https://issues.apache.org/jira/browse/SPARK-9294
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9293) Analysis should detect when set operations are performed on tables with different numbers of columns

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639837#comment-14639837
 ] 

Apache Spark commented on SPARK-9293:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7631

 Analysis should detect when set operations are performed on tables with 
 different numbers of columns
 

 Key: SPARK-9293
 URL: https://issues.apache.org/jira/browse/SPARK-9293
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 Our SQL analyzer doesn't always enforce that set operations are only 
 performed on relations with the same number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9216) Define KinesisBackedBlockRDDs

2015-07-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-9216.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 Define KinesisBackedBlockRDDs
 -

 Key: SPARK-9216
 URL: https://issues.apache.org/jira/browse/SPARK-9216
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
 Fix For: 1.5.0


 https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6548) Adding stddev to DataFrame functions

2015-07-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6548:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

 Adding stddev to DataFrame functions
 

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter

 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4366) Aggregation Improvement

2015-07-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639921#comment-14639921
 ] 

Yin Huai commented on SPARK-4366:
-

Here is a brief instruction on how to implement a built-in aggregate function 
that supports code-gen. 

For our new aggregate function interface, {{AlgebraicAggregate}} is the 
abstract class used for all built-in aggregate functions that support code-gen. 
Functions based on {{AlgebraicAggregate}} uses our existing expressions to 
implement operations like initializing aggregation buffer values, updating 
buffer, merging two buffers, and evaluating results. A good example is 
{{org.apache.spark.sql.catalyst.expressions.aggregate.Average}}. Since all 
operations of an {{AlgebraicAggregate}} are built on top of our expression 
system, the developer does not need to do anything special to support code-gen. 
It will just work out of the box. For those built-in functions that are hard to 
be expressed by our expressions, {{AggregateFunction2}} is the abstract class 
to use. 

For descriptions of aggregate functions, here are some references:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
https://prestodb.io/docs/current/functions/aggregate.html
https://msdn.microsoft.com/en-us/library/ms173454.aspx
http://www.postgresql.org/docs/devel/static/functions-aggregate.html


 Aggregation Improvement
 ---

 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical
 Attachments: aggregatefunction_v1.pdf


 This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6548) stddev_pop and stddev_samp aggregate functions

2015-07-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639923#comment-14639923
 ] 

Yin Huai commented on SPARK-6548:
-

A short introduction on how to build aggregate functions based on our new 
interface can be found at 
https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.

 stddev_pop and stddev_samp aggregate functions
 --

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter

 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9271) Concurrency bug triggered by partition predicate push-down

2015-07-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9271:
--
Description: 
SPARK-6910 and [PR #7492|https://github.com/apache/spark/pull/7492] introduced 
partition predicate push-down. However, it seems that it triggers one or more 
existing concurrency bug(s) (see [this GitHub 
comment|https://github.com/apache/spark/pull/7421#issuecomment-122527391] for 
details), and has been causing random Jenkins build failures. This issue need 
further investigation and must be fixed for 1.5.0.

Observed test failures possibly related to this issue:

- {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.partcols1}}
- 
{{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.auto_sortmerge_join_16}}

  was:SPARK-6910 and [PR #7492|https://github.com/apache/spark/pull/7492] 
introduced partition predicate push-down. However, it seems that it triggers 
one or more existing concurrency bug(s) (see [this GitHub 
comment|https://github.com/apache/spark/pull/7421#issuecomment-122527391] for 
details), and has been causing random Jenkins build failures. This issue need 
further investigation and must be fixed for 1.5.0.


 Concurrency bug triggered by partition predicate push-down
 --

 Key: SPARK-9271
 URL: https://issues.apache.org/jira/browse/SPARK-9271
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Blocker

 SPARK-6910 and [PR #7492|https://github.com/apache/spark/pull/7492] 
 introduced partition predicate push-down. However, it seems that it triggers 
 one or more existing concurrency bug(s) (see [this GitHub 
 comment|https://github.com/apache/spark/pull/7421#issuecomment-122527391] for 
 details), and has been causing random Jenkins build failures. This issue need 
 further investigation and must be fixed for 1.5.0.
 Observed test failures possibly related to this issue:
 - {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.partcols1}}
 - 
 {{org.apache.spark.sql.hive.execution.HiveCompatibilitySuite.auto_sortmerge_join_16}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6548) stddev_pop and stddev_samp aggregate functions

2015-07-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6548:

Summary: stddev_pop and stddev_samp aggregate functions  (was: Adding 
stddev to DataFrame functions)

 stddev_pop and stddev_samp aggregate functions
 --

 Key: SPARK-6548
 URL: https://issues.apache.org/jira/browse/SPARK-6548
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame, starter

 Add it to the list of aggregate functions:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
 Also add it to 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
 We can either add a Stddev Catalyst expression, or just compute it using 
 existing functions like here: 
 https://github.com/apache/spark/commit/5bbcd1304cfebba31ec6857a80d3825a40d02e83#diff-c3d0394b2fc08fb2842ff0362a5ac6c9R776



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9295) Analysis should detect sorting on unsupported column types

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9295:
---

Assignee: Apache Spark  (was: Josh Rosen)

 Analysis should detect sorting on unsupported column types
 --

 Key: SPARK-9295
 URL: https://issues.apache.org/jira/browse/SPARK-9295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Apache Spark

 The SQL analyzer should report errors for queries that try to sort on columns 
 of unsupported types, such as ArrayType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9295) Analysis should detect sorting on unsupported column types

2015-07-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9295:
---

Assignee: Josh Rosen  (was: Apache Spark)

 Analysis should detect sorting on unsupported column types
 --

 Key: SPARK-9295
 URL: https://issues.apache.org/jira/browse/SPARK-9295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 The SQL analyzer should report errors for queries that try to sort on columns 
 of unsupported types, such as ArrayType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9295) Analysis should detect sorting on unsupported column types

2015-07-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639912#comment-14639912
 ] 

Apache Spark commented on SPARK-9295:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7633

 Analysis should detect sorting on unsupported column types
 --

 Key: SPARK-9295
 URL: https://issues.apache.org/jira/browse/SPARK-9295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen

 The SQL analyzer should report errors for queries that try to sort on columns 
 of unsupported types, such as ArrayType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4366) Aggregation Improvement

2015-07-23 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639400#comment-14639400
 ] 

Herman van Hovell commented on SPARK-4366:
--

What is going to happen to the old Aggregate function code path? Will this 
still be in 1.5? Or will it be removed?

 Aggregation Improvement
 ---

 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical
 Attachments: aggregatefunction_v1.pdf


 This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4366) Aggregation Improvement

2015-07-23 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639403#comment-14639403
 ] 

Yin Huai commented on SPARK-4366:
-

It will be probably still in 1.5 (right now, we still have a few cases that 
need to fallback to the old path. For example, when you have multiple distinct 
columns). But, by default, we will use the new code path. In 1.6, the old path 
will be remove.

 Aggregation Improvement
 ---

 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical
 Attachments: aggregatefunction_v1.pdf


 This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8359) Spark SQL Decimal type precision loss on multiplication

2015-07-23 Thread Sudhakar Thota (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639472#comment-14639472
 ] 

Sudhakar Thota commented on SPARK-8359:
---

This is not working and breaking at 2^112 with spark 1.4.1, but working with 
git version with which I went up to 2^1020.  
Just an FYI.

 Spark SQL Decimal type precision loss on multiplication
 ---

 Key: SPARK-8359
 URL: https://issues.apache.org/jira/browse/SPARK-8359
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Rene Treffer

 It looks like the precision of decimal can not be raised beyond ~2^112 
 without causing full value truncation.
 The following code computes the power of two up to a specific point
 {code}
 import org.apache.spark.sql.types.Decimal
 val one = Decimal(1)
 val two = Decimal(2)
 def pow(n : Int) :  Decimal = if (n = 0) { one } else { 
   val a = pow(n - 1)
   a.changePrecision(n,0)
   two.changePrecision(n,0)
   a * two
 }
 (109 to 120).foreach(n = 
 println(pow(n).toJavaBigDecimal.unscaledValue.toString))
 649037107316853453566312041152512
 1298074214633706907132624082305024
 2596148429267413814265248164610048
 5192296858534827628530496329220096
 1038459371706965525706099265844019
 2076918743413931051412198531688038
 4153837486827862102824397063376076
 8307674973655724205648794126752152
 1661534994731144841129758825350430
 3323069989462289682259517650700860
 6646139978924579364519035301401720
 1329227995784915872903807060280344
 {code}
 Beyond ~2^112 the precision is truncated even if the precision was set to n 
 and should thus handle 10^n without problems..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1744) Document how to pass in preferredNodeLocationData

2015-07-23 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-1744.
---
Resolution: Won't Fix

 Document how to pass in preferredNodeLocationData
 -

 Key: SPARK-1744
 URL: https://issues.apache.org/jira/browse/SPARK-1744
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9286) Methods in Unevaluable should be final

2015-07-23 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9286:
-

 Summary: Methods in Unevaluable should be final
 Key: SPARK-9286
 URL: https://issues.apache.org/jira/browse/SPARK-9286
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Trivial


The {{eval()}} and {{genCode()}} methods in SQL's {{Unevaluable}} trait should 
be marked as {{final}} and we should fix any cases where they are overridden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port

2015-07-23 Thread Omar Padron (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omar Padron updated SPARK-9279:
---
Description: 
When trying to start a spark master server as root...

{code}
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=80

spark-class org.apache.spark.deploy.master.Master \
--host $( hostname ) \
--port $SPARK_MASTER_PORT \
--webui-port $SPARK_MASTER_WEBUI_PORT
{code}

The process terminates with IllegalArgumentException requirement failed: 
startPort should be between 1024 and 65535 (inclusive), or 0 for a random free 
port.

But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs 
fine.

I do not understand why the usable ports have been arbitrarily restricted to 
the non-privileged.  Users choosing to run spark as root should be allowed to 
choose their own ports.

Full output from a sample run below:
{code}
2015-07-23 14:36:50,892 INFO  [main] master.Master 
(SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, 
INT]
2015-07-23 14:36:51,399 WARN  [main] util.NativeCodeLoader 
(NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
2015-07-23 14:36:51,586 INFO  [main] spark.SecurityManager 
(Logging.scala:logInfo(59)) - Changing view acls to: root
2015-07-23 14:36:51,587 INFO  [main] spark.SecurityManager 
(Logging.scala:logInfo(59)) - Changing modify acls to: root
2015-07-23 14:36:51,588 INFO  [main] spark.SecurityManager 
(Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls 
disabled; users with view permissions: Set(root); users with modify 
permissions: Set(root)
2015-07-23 14:36:52,295 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2015-07-23 14:36:52,349 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2015-07-23 14:36:52,489 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on 
addresses :[akka.tcp://sparkMaster@sparkmaster:7077]
2015-07-23 14:36:52,497 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - 
Successfully started service 'sparkMaster' on port 7077.
2015-07-23 14:36:52,717 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT
2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started 
SelectChannelConnector@sparkmaster:6066
2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 
6066.
2015-07-23 14:36:52,760 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server for 
submitting applications on port 6066
2015-07-23 14:36:52,765 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
master.Master (Logging.scala:logInfo(59)) - Starting Spark master at 
spark://sparkmaster:7077
2015-07-23 14:36:52,766 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1
2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] 
ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI
java.lang.IllegalArgumentException: requirement failed: startPort should be 
between 1024 and 65535 (inclusive), or 0 for a random free port.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1977)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:117)
at org.apache.spark.deploy.master.Master.preStart(Master.scala:144)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.master.Master.aroundPreStart(Master.scala:52)
at akka.actor.ActorCell.create(ActorCell.scala:580)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Affects Version/s: 1.3.1

 New HiveContext object unexpectedly loads configuration settings from history 
 --

 Key: SPARK-9280
 URL: https://issues.apache.org/jira/browse/SPARK-9280
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1
Reporter: Tien-Dung LE

 In a spark-shell session, stopping a spark context and create a new spark 
 context and hive context does not clean the spark sql configuration. More 
 precisely, the new hive context still keeps the previous configuration 
 settings. Here is a code to show this scenario.
 {code:title=New hive context should not load the configurations from history}
 case class Foo ( x: Int = (math.random * 1e3).toInt)
 val foo = (1 to 100).map(i = Foo()).toDF
 foo.saveAsParquetFile( foo )
 sqlContext.setConf( spark.sql.shuffle.partitions, 10)
 sc.stop
 val sparkConf2 = new org.apache.spark.SparkConf()
 val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
 val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
 val foo2 = sqlContext2.parquetFile( foo )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
 // expected 30 but got 10
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)
Tien-Dung LE created SPARK-9280:
---

 Summary: New HiveContext object unexpectedly loads configuration 
settings from history 
 Key: SPARK-9280
 URL: https://issues.apache.org/jira/browse/SPARK-9280
 Project: Spark
  Issue Type: Bug
Reporter: Tien-Dung LE


In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. Here is a code to show this scenario.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i = Foo()).toDF
foo.saveAsParquetFile( foo )
sqlContext.setConf( spark.sql.shuffle.partitions, 10)

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
val foo2 = sqlContext2.parquetFile( foo )
sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
// expected 30 but got 10
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Description: 
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i = Foo()).toDF
foo.saveAsParquetFile( foo )
sqlContext.setConf( spark.sql.shuffle.partitions, 10)

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
// got 20 as expected
val foo2 = sqlContext2.parquetFile( foo )
sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
// expected 30 but got 10
{code}

  was:
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i = Foo()).toDF
foo.saveAsParquetFile( foo )
sqlContext.setConf( spark.sql.shuffle.partitions, 10)

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
val foo2 = sqlContext2.parquetFile( foo )
sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
// expected 30 but got 10
{code}


 New HiveContext object unexpectedly loads configuration settings from history 
 --

 Key: SPARK-9280
 URL: https://issues.apache.org/jira/browse/SPARK-9280
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Tien-Dung LE

 In a spark-shell session, stopping a spark context and create a new spark 
 context and hive context does not clean the spark sql configuration. More 
 precisely, the new hive context still keeps the previous configuration 
 settings. It would be great if someone can let us know how to avoid this 
 situation.
 {code:title=New hive context should not load the configurations from history}
 case class Foo ( x: Int = (math.random * 1e3).toInt)
 val foo = (1 to 100).map(i = Foo()).toDF
 foo.saveAsParquetFile( foo )
 sqlContext.setConf( spark.sql.shuffle.partitions, 10)
 sc.stop
 val sparkConf2 = new org.apache.spark.SparkConf()
 val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
 val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
 // got 20 as expected
 val foo2 = sqlContext2.parquetFile( foo )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
 // expected 30 but got 10
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Component/s: SQL

 New HiveContext object unexpectedly loads configuration settings from history 
 --

 Key: SPARK-9280
 URL: https://issues.apache.org/jira/browse/SPARK-9280
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Tien-Dung LE

 In a spark-shell session, stopping a spark context and create a new spark 
 context and hive context does not clean the spark sql configuration. More 
 precisely, the new hive context still keeps the previous configuration 
 settings. Here is a code to show this scenario.
 {code:title=New hive context should not load the configurations from history}
 case class Foo ( x: Int = (math.random * 1e3).toInt)
 val foo = (1 to 100).map(i = Foo()).toDF
 foo.saveAsParquetFile( foo )
 sqlContext.setConf( spark.sql.shuffle.partitions, 10)
 sc.stop
 val sparkConf2 = new org.apache.spark.SparkConf()
 val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
 val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
 val foo2 = sqlContext2.parquetFile( foo )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
 // expected 30 but got 10
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9280) New HiveContext object unexpectedly loads configuration settings from history

2015-07-23 Thread Tien-Dung LE (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-9280:

Description: 
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. It would be great if someone can let us know how to avoid this 
situation.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i = Foo()).toDF
foo.saveAsParquetFile( foo )
sqlContext.setConf( spark.sql.shuffle.partitions, 10)

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
val foo2 = sqlContext2.parquetFile( foo )
sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
// expected 30 but got 10
{code}

  was:
In a spark-shell session, stopping a spark context and create a new spark 
context and hive context does not clean the spark sql configuration. More 
precisely, the new hive context still keeps the previous configuration 
settings. Here is a code to show this scenario.

{code:title=New hive context should not load the configurations from history}
case class Foo ( x: Int = (math.random * 1e3).toInt)
val foo = (1 to 100).map(i = Foo()).toDF
foo.saveAsParquetFile( foo )
sqlContext.setConf( spark.sql.shuffle.partitions, 10)

sc.stop

val sparkConf2 = new org.apache.spark.SparkConf()
val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )

sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
val foo2 = sqlContext2.parquetFile( foo )
sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
// expected 30 but got 10
{code}


 New HiveContext object unexpectedly loads configuration settings from history 
 --

 Key: SPARK-9280
 URL: https://issues.apache.org/jira/browse/SPARK-9280
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Tien-Dung LE

 In a spark-shell session, stopping a spark context and create a new spark 
 context and hive context does not clean the spark sql configuration. More 
 precisely, the new hive context still keeps the previous configuration 
 settings. It would be great if someone can let us know how to avoid this 
 situation.
 {code:title=New hive context should not load the configurations from history}
 case class Foo ( x: Int = (math.random * 1e3).toInt)
 val foo = (1 to 100).map(i = Foo()).toDF
 foo.saveAsParquetFile( foo )
 sqlContext.setConf( spark.sql.shuffle.partitions, 10)
 sc.stop
 val sparkConf2 = new org.apache.spark.SparkConf()
 val sc2 = new org.apache.spark.SparkContext( sparkConf2 ) 
 val sqlContext2 = new org.apache.spark.sql.hive.HiveContext( sc2 )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 20) 
 val foo2 = sqlContext2.parquetFile( foo )
 sqlContext2.getConf( spark.sql.shuffle.partitions, 30)
 // expected 30 but got 10
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

2015-07-23 Thread Steve Lindemann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638843#comment-14638843
 ] 

Steve Lindemann commented on SPARK-9278:


Here are the steps to reproduce the issue. First, create a Hive table with the 
desired schema:

{noformat}
In [1]: hc = pyspark.sql.HiveContext(sqlContext)
In [2]: pdf = pd.DataFrame({'pk': ['a']*5+['b']*5+['c']*5, 'k': ['a', 'e', 'i', 
'o', 'u']*3, 'v': range(15)})
In [3]: sdf = hc.createDataFrame(pdf)
In [4]: sdf.show()
+-+--+--+
|k|pk| v|
+-+--+--+
|a| a| 0|
|e| a| 1|
|i| a| 2|
|o| a| 3|
|u| a| 4|
|a| b| 5|
|e| b| 6|
|i| b| 7|
|o| b| 8|
|u| b| 9|
|a| c|10|
|e| c|11|
|i| c|12|
|o| c|13|
|u| c|14|
+-+--+--+
In [5]: sdf.filter('FALSE').write.partitionBy('pk').saveAsTable('foo', 
format='parquet', path='s3a://eglp-core-temp/tmp/foo')
{noformat}

A table has been created:

{noformat}
In [33]: print('\n'.join(r.result for r in hc.sql('SHOW CREATE TABLE 
foo').collect()))
CREATE EXTERNAL TABLE `foo`(
  `col` arraystring COMMENT 'from deserializer')
PARTITIONED BY (
  `pk` string COMMENT '')
ROW FORMAT DELIMITED
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
  's3a://eglp-core-data/hive/warehouse/foo'
TBLPROPERTIES (
  
'spark.sql.sources.schema.part.0'='{\type\:\struct\,\fields\:[{\name\:\k\,\type\:\string\,\nullable\:true,\metadata\:{}},{\name\:\v\,\type\:\long\,\nullable\:true,\metadata\:{}},{\name\:\pk\,\type\:\string\,\nullable\:true,\metadata\:{}}]}',
  'transient_lastDdlTime'='1437657391',
  'spark.sql.sources.schema.numParts'='1',
  'spark.sql.sources.provider'='parquet')
{noformat}

Now, write a new partition of data (note that this is from the same DataFrame 
from which the table was created):

{noformat}
sdf.filter(sdf.pk == 'a').write.partitionBy('pk').insertInto('foo')
{noformat}

Then, select the data:

{noformat}
In [7]: foo = hc.table('foo')
In [8]: foo.show()
+-++--+
|k|   v|pk|
+-++--+
|a|null| 0|
|o|null| 3|
|i|null| 2|
|e|null| 1|
|u|null| 4|
+-++--+
In [9]: sdf.filter(sdf.pk == 'a').show()
+-+--+-+
|k|pk|v|
+-+--+-+
|a| a|0|
|e| a|1|
|i| a|2|
|o| a|3|
|u| a|4|
+-+--+-+
{noformat}

So clearly it inserted incorrect data. By reordering the columns, we can insert 
data properly:

{noformat}
In [10]: pdf2 = pdf[['k', 'v', 'pk']]
In [11]: sdf2 = hc.createDataFrame(pdf2)
In [12]: sdf2.filter(sdf2.pk == 'a').write.partitionBy('pk').insertInto('foo')
In [13]: hc.refreshTable('foo')
In [14]: foo = hc.table('foo')
In [15]: foo.show()
+-++--+
|k|   v|pk|
+-++--+
|a|null| 0|
|o|null| 3|
|i|null| 2|
|e|null| 1|
|u|null| 4|
|o|   3| a|
|u|   4| a|
|a|   0| a|
|e|   1| a|
|i|   2| a|
+-++--+
{noformat}

 DataFrameWriter.insertInto inserts incorrect data
 -

 Key: SPARK-9278
 URL: https://issues.apache.org/jira/browse/SPARK-9278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: Linux, S3, Hive Metastore
Reporter: Steve Lindemann

 After creating a partitioned Hive table (stored as Parquet) via the 
 DataFrameWriter.createTable command, subsequent attempts to insert additional 
 data into new partitions of this table result in inserting incorrect data 
 rows. Reordering the columns in the data to be written seems to avoid this 
 issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port

2015-07-23 Thread Omar Padron (JIRA)
Omar Padron created SPARK-9279:
--

 Summary: Spark Master Refuses to Bind WebUI to a Privileged Port
 Key: SPARK-9279
 URL: https://issues.apache.org/jira/browse/SPARK-9279
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
 Environment: Ubuntu Trusty running in a docker container
Reporter: Omar Padron
Priority: Minor


When trying to start a spark master server as root...

{code}
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=80

spark-class org.apache.spark.deploy.master.Master \
--host $( hostname ) \
-- port $SPARK_MASTER_PORT \
--webui-port $SPARK_MASTER_WEBUI_PORT
{code}

The process terminates with IllegalArgumentException requirement failed: 
startPort should be between 1024 and 65535 (inclusive), or 0 for a random free 
port.

But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs 
fine.

I do not understand why the usable ports have been arbitrarily restricted to 
the non-privileged.  Users choosing to run spark as root should be allowed to 
choose their own ports.

Full output from a sample run below:
{code}
2015-07-23 14:36:50,892 INFO  [main] master.Master 
(SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, 
INT]
2015-07-23 14:36:51,399 WARN  [main] util.NativeCodeLoader 
(NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for 
your platform... using builtin-java classes where applicable
2015-07-23 14:36:51,586 INFO  [main] spark.SecurityManager 
(Logging.scala:logInfo(59)) - Changing view acls to: root
2015-07-23 14:36:51,587 INFO  [main] spark.SecurityManager 
(Logging.scala:logInfo(59)) - Changing modify acls to: root
2015-07-23 14:36:51,588 INFO  [main] spark.SecurityManager 
(Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls 
disabled; users with view permissions: Set(root); users with modify 
permissions: Set(root)
2015-07-23 14:36:52,295 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2015-07-23 14:36:52,349 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2015-07-23 14:36:52,489 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on 
addresses :[akka.tcp://sparkMaster@sparkmaster:7077]
2015-07-23 14:36:52,497 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - 
Successfully started service 'sparkMaster' on port 7077.
2015-07-23 14:36:52,717 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT
2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started 
SelectChannelConnector@sparkmaster:6066
2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 
6066.
2015-07-23 14:36:52,760 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server for 
submitting applications on port 6066
2015-07-23 14:36:52,765 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
master.Master (Logging.scala:logInfo(59)) - Starting Spark master at 
spark://sparkmaster:7077
2015-07-23 14:36:52,766 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1
2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] 
ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI
java.lang.IllegalArgumentException: requirement failed: startPort should be 
between 1024 and 65535 (inclusive), or 0 for a random free port.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1977)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:117)
at org.apache.spark.deploy.master.Master.preStart(Master.scala:144)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.master.Master.aroundPreStart(Master.scala:52)
at akka.actor.ActorCell.create(ActorCell.scala:580)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at 

[jira] [Resolved] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port

2015-07-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9279.
--
Resolution: Not A Problem

This has nothing to do with Spark. Any Linux-like OS X requires root privileges 
for any process to bind to a port under 1024.

 Spark Master Refuses to Bind WebUI to a Privileged Port
 ---

 Key: SPARK-9279
 URL: https://issues.apache.org/jira/browse/SPARK-9279
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
 Environment: Ubuntu Trusty running in a docker container
Reporter: Omar Padron
Priority: Minor

 When trying to start a spark master server as root...
 {code}
 export SPARK_MASTER_PORT=7077
 export SPARK_MASTER_WEBUI_PORT=80
 spark-class org.apache.spark.deploy.master.Master \
 --host $( hostname ) \
 --port $SPARK_MASTER_PORT \
 --webui-port $SPARK_MASTER_WEBUI_PORT
 {code}
 The process terminates with IllegalArgumentException requirement failed: 
 startPort should be between 1024 and 65535 (inclusive), or 0 for a random 
 free port.
 But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs 
 fine.
 I do not understand why the usable ports have been arbitrarily restricted to 
 the non-privileged.  Users choosing to run spark as root should be allowed to 
 choose their own ports.
 Full output from a sample run below:
 {code}
 2015-07-23 14:36:50,892 INFO  [main] master.Master 
 (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, 
 HUP, INT]
 2015-07-23 14:36:51,399 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 2015-07-23 14:36:51,586 INFO  [main] spark.SecurityManager 
 (Logging.scala:logInfo(59)) - Changing view acls to: root
 2015-07-23 14:36:51,587 INFO  [main] spark.SecurityManager 
 (Logging.scala:logInfo(59)) - Changing modify acls to: root
 2015-07-23 14:36:51,588 INFO  [main] spark.SecurityManager 
 (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui 
 acls disabled; users with view permissions: Set(root); users with modify 
 permissions: Set(root)
 2015-07-23 14:36:52,295 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
 slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
 2015-07-23 14:36:52,349 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
 Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
 2015-07-23 14:36:52,489 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
 Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening 
 on addresses :[akka.tcp://sparkMaster@sparkmaster:7077]
 2015-07-23 14:36:52,497 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - 
 Successfully started service 'sparkMaster' on port 7077.
 2015-07-23 14:36:52,717 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT
 2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started 
 SelectChannelConnector@sparkmaster:6066
 2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 
 6066.
 2015-07-23 14:36:52,760 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server 
 for submitting applications on port 6066
 2015-07-23 14:36:52,765 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 master.Master (Logging.scala:logInfo(59)) - Starting Spark master at 
 spark://sparkmaster:7077
 2015-07-23 14:36:52,766 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1
 2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] 
 ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI
 java.lang.IllegalArgumentException: requirement failed: startPort should be 
 between 1024 and 65535 (inclusive), or 0 for a random free port.
 at scala.Predef$.require(Predef.scala:233)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1977)
 at 
 org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:238)
 at org.apache.spark.ui.WebUI.bind(WebUI.scala:117)
 at org.apache.spark.deploy.master.Master.preStart(Master.scala:144)
 at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
 at 
 org.apache.spark.deploy.master.Master.aroundPreStart(Master.scala:52)
 at akka.actor.ActorCell.create(ActorCell.scala:580)
 at 

[jira] [Updated] (SPARK-9270) spark.app.name is not honored by pyspark

2015-07-23 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated SPARK-9270:
-
Description: 
Currently, the app name is hardcoded in pyspark as PySparkShell, and the 
{{spark.app.name}} property is not honored.

SPARK-8650 and SPARK-9180 fixed this issue for spark-sql and spark-shell, but 
pyspark is not fixed yet. sparkR is different because {{SparkContext}} is not 
automatically constructed in sparkR, and the app name can be set when 
initializing {{SparkContext}}.

In summary-
||shell||support --conf spark.app.name||
|pyspark|no|
|spark-shell|yes|
|spark-sql|yes|
|sparkR|n/a| 

  was:
Currently, the app name is hardcoded in spark-shell and pyspark as SparkShell 
and PySparkShell respectively, and the {{spark.app.name}} property is not 
honored.

But being able to set the app name is quite handy for various cluster 
operations. For eg, filter jobs whose app name is X on YARN RM page, etc.

SPARK-8650 fixed this issue for spark-sql, but it didn't for spark-shell and 
pyspark. sparkR is different because {{SparkContext}} is not automatically 
constructed in sparkR, and the app name can be set when intializing 
{{SparkContext}}.

In summary-
||shell||support --conf spark.app.name||
|spark-shell|no|
|pyspark|no|
|spark-sql|yes|
|sparkR|n/a| 

Component/s: (was: Spark Shell)
Summary: spark.app.name is not honored by pyspark  (was: spark.app.name 
is not honored by spark-shell and pyspark)

 spark.app.name is not honored by pyspark
 

 Key: SPARK-9270
 URL: https://issues.apache.org/jira/browse/SPARK-9270
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1, 1.5.0
Reporter: Cheolsoo Park
Priority: Minor

 Currently, the app name is hardcoded in pyspark as PySparkShell, and the 
 {{spark.app.name}} property is not honored.
 SPARK-8650 and SPARK-9180 fixed this issue for spark-sql and spark-shell, but 
 pyspark is not fixed yet. sparkR is different because {{SparkContext}} is not 
 automatically constructed in sparkR, and the app name can be set when 
 initializing {{SparkContext}}.
 In summary-
 ||shell||support --conf spark.app.name||
 |pyspark|no|
 |spark-shell|yes|
 |spark-sql|yes|
 |sparkR|n/a| 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7254) Extend PIC to handle Graphs directly

2015-07-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7254.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6054
[https://github.com/apache/spark/pull/6054]

 Extend PIC to handle Graphs directly
 

 Key: SPARK-7254
 URL: https://issues.apache.org/jira/browse/SPARK-7254
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, MLlib
Reporter: Joseph K. Bradley
 Fix For: 1.5.0


 We should extend the PowerIterationClustering API to handle Graphs.  Users 
 can do spectral clustering on graphs using PIC currently, but they must 
 handle the boilerplate of converting the Graph to an RDD for PIC, running 
 PIC, and then matching the results back with their Graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7254) Extend PIC to handle Graphs directly

2015-07-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7254:
-
Assignee: Liang-Chi Hsieh

 Extend PIC to handle Graphs directly
 

 Key: SPARK-7254
 URL: https://issues.apache.org/jira/browse/SPARK-7254
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, MLlib
Reporter: Joseph K. Bradley
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 We should extend the PowerIterationClustering API to handle Graphs.  Users 
 can do spectral clustering on graphs using PIC currently, but they must 
 handle the boilerplate of converting the Graph to an RDD for PIC, running 
 PIC, and then matching the results back with their Graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7254) Extend PIC to handle Graphs directly

2015-07-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7254:
-
Target Version/s: 1.5.0

 Extend PIC to handle Graphs directly
 

 Key: SPARK-7254
 URL: https://issues.apache.org/jira/browse/SPARK-7254
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, MLlib
Reporter: Joseph K. Bradley
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 We should extend the PowerIterationClustering API to handle Graphs.  Users 
 can do spectral clustering on graphs using PIC currently, but they must 
 handle the boilerplate of converting the Graph to an RDD for PIC, running 
 PIC, and then matching the results back with their Graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9243) Update crosstab doc for pairs that have no occurrences

2015-07-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-9243:


Assignee: Xiangrui Meng

 Update crosstab doc for pairs that have no occurrences
 --

 Key: SPARK-9243
 URL: https://issues.apache.org/jira/browse/SPARK-9243
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark, SparkR, SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The crosstab value for pairs that have no occurrences was changed from null 
 to 0 in SPARK-7982. We should update the doc in Scala, Python, and SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9272) Persist information of individual partitions when persisting partitioned data source tables to metastore

2015-07-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-9272:
-

 Summary: Persist information of individual partitions when 
persisting partitioned data source tables to metastore
 Key: SPARK-9272
 URL: https://issues.apache.org/jira/browse/SPARK-9272
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian


Currently, when a partitioned data source table is persisted to Hive metastore, 
we only persist its partition columns. Information about individual partitions 
are not persisted. This forces us to do a partition discovery before reading a 
persisted partitioned table, which hurts performance.

To fix this issue, we may persist partition information into metastore. 
Specifically, the format should be compatible with Hive to ensure 
interoperability.

One of the approach to collect partition values and partition directory path 
for dynamicly partitioned tables is to use accumulators to collect expected 
information during the write job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >