[jira] [Commented] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518924#comment-14518924
 ] 

Apache Spark commented on SPARK-7229:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/5772

 SpecificMutableRow should take integer type as internal representation for 
 DateType
 ---

 Key: SPARK-7229
 URL: https://issues.apache.org/jira/browse/SPARK-7229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 {code}
   test(test DATE types in cache) {
 val rows = TestSQLContext.jdbc(urlWithUserAndPass, 
 TEST.TIMETYPES).collect()
 TestSQLContext.jdbc(urlWithUserAndPass, 
 TEST.TIMETYPES).cache().registerTempTable(mycached_date)
 val cachedRows = sql(select * from mycached_date).collect()
 assert(rows(0).getAs[java.sql.Date](1) === 
 java.sql.Date.valueOf(1996-01-01))
 assert(cachedRows(0).getAs[java.sql.Date](1) === 
 java.sql.Date.valueOf(1996-01-01))
   }
 {code}
 java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252)
   at 
 org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208)
   at 
 org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)
 {panel}
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7222) Added mathematical derivation in comment to LinearRegression with ElasticNet

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7222:
---

Assignee: Apache Spark

 Added mathematical derivation in comment to LinearRegression with ElasticNet
 

 Key: SPARK-7222
 URL: https://issues.apache.org/jira/browse/SPARK-7222
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: DB Tsai
Assignee: Apache Spark

 Added detailed mathematical derivation of how scaling and 
 LeastSquaresAggregator work. Also refactored the code. TODO: Add test that 
 fail the test when the correction terms are not correctly computed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7159) Support multiclass logistic regression in spark.ml

2015-04-29 Thread Selim Namsi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518952#comment-14518952
 ] 

Selim Namsi commented on SPARK-7159:


I'll Work on it 

 Support multiclass logistic regression in spark.ml
 --

 Key: SPARK-7159
 URL: https://issues.apache.org/jira/browse/SPARK-7159
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley

 This should be implemented by checking the input DataFrame's label column for 
 feature metadata specifying the number of classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7222) Added mathematical derivation in comment to LinearRegression with ElasticNet

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518769#comment-14518769
 ] 

Apache Spark commented on SPARK-7222:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5767

 Added mathematical derivation in comment to LinearRegression with ElasticNet
 

 Key: SPARK-7222
 URL: https://issues.apache.org/jira/browse/SPARK-7222
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: DB Tsai

 Added detailed mathematical derivation of how scaling and 
 LeastSquaresAggregator work. Also refactored the code. TODO: Add test that 
 fail the test when the correction terms are not correctly computed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6824) Fill the docs for DataFrame API in SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6824:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-7228

 Fill the docs for DataFrame API in SparkR
 -

 Key: SPARK-6824
 URL: https://issues.apache.org/jira/browse/SPARK-6824
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Blocker

 Some of the DataFrame functions in SparkR do not have complete roxygen docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7225) CombineLimits in Optimizer do not works

2015-04-29 Thread Zhongshuai Pei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7225:
--
Summary: CombineLimits in Optimizer do not works  (was: CombineLimits do 
not works)

 CombineLimits in Optimizer do not works
 ---

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6815) Support accumulators in R

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6815:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Support accumulators in R
 -

 Key: SPARK-6815
 URL: https://issues.apache.org/jira/browse/SPARK-6815
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 SparkR doesn't support acccumulators right now.  It might be good to add 
 support for this to get feature parity with PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType

2015-04-29 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-7229:


 Summary: SpecificMutableRow should take integer type as internal 
representation for DateType
 Key: SPARK-7229
 URL: https://issues.apache.org/jira/browse/SPARK-7229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


{code}
  test(test DATE types in cache) {
val rows = TestSQLContext.jdbc(urlWithUserAndPass, 
TEST.TIMETYPES).collect()
TestSQLContext.jdbc(urlWithUserAndPass, 
TEST.TIMETYPES).cache().registerTempTable(mycached_date)
val cachedRows = sql(select * from mycached_date).collect()
assert(rows(0).getAs[java.sql.Date](1) === 
java.sql.Date.valueOf(1996-01-01))
assert(cachedRows(0).getAs[java.sql.Date](1) === 
java.sql.Date.valueOf(1996-01-01))
  }
{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252)
at 
org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208)
at 
org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
at 
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{panel}

{panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7225) CombineLimits do not works

2015-04-29 Thread Zhongshuai Pei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7225:
--
Description: 
The  optimized logical plan of  select key from (select key from src limit 
100) t2 limit 10  like that 
{quote}
== Optimized Logical Plan ==
Limit 10
 Limit 100
  Project [key#3]
   MetastoreRelation default, src, None
{quote}

It did not 

 CombineLimits do not works
 --

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7223:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Rename RPC askWithReply - askWithReply, sendWithReply - ask
 -

 Key: SPARK-7223
 URL: https://issues.apache.org/jira/browse/SPARK-7223
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Apache Spark

 Current naming is too confusing between askWithReply and sendWithReply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7202:
---

Assignee: Apache Spark

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Apache Spark

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7157) Add approximate stratified sampling to DataFrame

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7157:
---
Summary: Add approximate stratified sampling to DataFrame  (was: Add 
sampleByKey, sampleByKeyExact methods to DataFrame)

 Add approximate stratified sampling to DataFrame
 

 Key: SPARK-7157
 URL: https://issues.apache.org/jira/browse/SPARK-7157
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Joseph K. Bradley
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518927#comment-14518927
 ] 

Apache Spark commented on SPARK-6752:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/5773

 Allow StreamingContext to be recreated from checkpoint and existing 
 SparkContext
 

 Key: SPARK-6752
 URL: https://issues.apache.org/jira/browse/SPARK-6752
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.1, 1.2.1, 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.4.0


 Currently if you want to create a StreamingContext from checkpoint 
 information, the system will create a new SparkContext. This prevent 
 StreamingContext to be recreated from checkpoints in managed environments 
 where SparkContext is precreated.
 Proposed solution: Introduce the following methods on StreamingContext
 1. {{new StreamingContext(checkpointDirectory, sparkContext)}}
 - Recreate StreamingContext from checkpoint using the provided SparkContext
 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}}
 - Recreate StreamingContext from checkpoint using the provided SparkContext 
 and hadoop conf to read the checkpoint
 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, 
 createFunction: SparkContext = StreamingContext)}}
 - If checkpoint file exists, then recreate StreamingContext using the 
 provided SparkContext (that is, call 1.), else create StreamingContext using 
 the provided createFunction
 Also, the corresponding Java and Python API has to be added as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6825) Data sources implementation to support `sequenceFile`

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6825:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Data sources implementation to support `sequenceFile`
 -

 Key: SPARK-6825
 URL: https://issues.apache.org/jira/browse/SPARK-6825
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 SequenceFiles are a widely used input format and right now they are not 
 supported in SparkR. 
 It would be good to add support for SequenceFiles by implementing a new data 
 source that can create a DataFrame from a SequenceFile. However as 
 SequenceFiles can have arbitrary types, we probably need to map them to 
 User-defined types in SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7133:
---
Assignee: Wenchen Fan

 Implement struct, array, and map field accessor using apply in Scala and 
 __getitem__ in Python
 --

 Key: SPARK-7133
 URL: https://issues.apache.org/jira/browse/SPARK-7133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan
  Labels: starter

 Typing 
 {code}
 df.col[1]
 {code}
 and
 {code}
 df.col['field']
 {code}
 is so much eaiser than
 {code}
 df.col.getField('field')
 df.col.getItem(1)
 {code}
 This would require us to define (in Column) an apply function in Scala, and a 
 __getitem__ function in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7225) CombineLimits in Optimizer does not works

2015-04-29 Thread Zhongshuai Pei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7225:
--
Summary: CombineLimits in Optimizer does not works  (was: CombineLimits in 
Optimizer do not works)

 CombineLimits in Optimizer does not works
 -

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask

2015-04-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7223:
--

 Summary: Rename RPC askWithReply - askWithReply, sendWithReply - 
ask
 Key: SPARK-7223
 URL: https://issues.apache.org/jira/browse/SPARK-7223
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


Current naming is too confusing between askWithReply and sendWithReply.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7225) CombineLimits optimizer does not work

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518813#comment-14518813
 ] 

Apache Spark commented on SPARK-7225:
-

User 'DoingDone9' has created a pull request for this issue:
https://github.com/apache/spark/pull/5770

 CombineLimits optimizer does not work
 -

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3808) PySpark fails to start in Windows

2015-04-29 Thread eminent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518795#comment-14518795
 ] 

eminent commented on SPARK-3808:


Yes. It's the cause. 
After updating the %PATH%, spark was successfully launched. Thanks so much for 
your help!

 PySpark fails to start in Windows
 -

 Key: SPARK-3808
 URL: https://issues.apache.org/jira/browse/SPARK-3808
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Windows
Affects Versions: 1.2.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Blocker
 Fix For: 1.2.0


 When we execute bin\pyspark.cmd in Windows, it fails to start.
 We get following messages.
 {noformat}
 C:\bin\pyspark.cmd
 Running C:\\python.exe with 
 PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python;
 Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
 win32
 Type help, copyright, credits or license for more information.
 =x was unexpected at this time.
 Traceback (most recent call last):
   File C:\\bin\..\python\pyspark\shell.py, line 45, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File C:\\python\pyspark\context.py, line 103, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File C:\\python\pyspark\context.py, line 212, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File C:\\python\pyspark\java_gateway.py, line 71, in launch_gateway
 raise Exception(error_msg)
 Exception: Launching GatewayServer failed with exit code 255!
 Warning: Expected GatewayServer to output a port, but found no output.
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame

2015-04-29 Thread Kalle Jepsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518925#comment-14518925
 ] 

Kalle Jepsen commented on SPARK-7035:
-

I've created a PR to fix the error message in 
https://github.com/apache/spark/pull/5771. I didn't deem it necessary to open a 
JIRA for a minor change like this and hope that was the right thing to do.

 Drop __getattr__ on pyspark.sql.DataFrame
 -

 Key: SPARK-7035
 URL: https://issues.apache.org/jira/browse/SPARK-7035
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Kalle Jepsen

 I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.
 There is no point in having the possibility to address the DataFrames columns 
 as {{df.column}}, other than the questionable goal to please R developers. 
 And it seems R people can use Spark from their native API in the future.
 I see the following problems with {{\_\_getattr\_\_}} for column selection:
 * It's un-pythonic: There should only be one obvious way to solve a problem, 
 and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} 
 method, which in my opinion is by far superior and a lot more intuitive.
 * It leads to confusing Exceptions. When we mistype a method-name the 
 {{AttributeError}} will say 'No such column ... '.
 * And most importantly: we cannot load DataFrames that have columns with the 
 same name as any attribute on the DataFrame-object. Imagine having a 
 DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} 
 will be ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6816) Add SparkConf API to configure SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6816:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Add SparkConf API to configure SparkR
 -

 Key: SPARK-6816
 URL: https://issues.apache.org/jira/browse/SPARK-6816
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Right now the only way to configure SparkR is to pass in arguments to 
 sparkR.init. The goal is to add an API similar to SparkConf on Scala/Python 
 to make configuration easier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7223:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Rename RPC askWithReply - askWithReply, sendWithReply - ask
 -

 Key: SPARK-7223
 URL: https://issues.apache.org/jira/browse/SPARK-7223
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Current naming is too confusing between askWithReply and sendWithReply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7232) Add a Substitution batch for spark sql analyzer

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7232:
---

Assignee: (was: Apache Spark)

 Add a Substitution batch for spark sql analyzer
 ---

 Key: SPARK-7232
 URL: https://issues.apache.org/jira/browse/SPARK-7232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang

 Added a new batch named `Substitution` before Resolution batch. The 
 motivation for this is there are kind of cases we want to do some 
 substitution on the parsed logical plan before resolve it. 
 Consider this two cases:
 1 CTE, for cte we first build a row logical plan
 'With Map(q1 - 'Subquery q1
  'Project ['key]
'UnresolvedRelation [src], None)
  'Project [*]
   'Filter ('key = 5)
'UnresolvedRelation [q1], None
 In `With` logicalplan here is a map stored the (q1- subquery), we want first 
 take off the with command and substitute the  q1 of UnresolvedRelation by the 
 subquery
 2 Another example is Window function, in window function user may define some 
 windows, we also need substitute the window name of child by the concrete 
 window. this should also done in the Substitution batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception

2015-04-29 Thread Chen Song (JIRA)
Chen Song created SPARK-7234:


 Summary: When codegen on DateType defaultPrimitive will throw type 
mismatch exception
 Key: SPARK-7234
 URL: https://issues.apache.org/jira/browse/SPARK-7234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Chen Song


When codegen on, the defaultPrimitive of DateType is null. This will rise below 
error.

select COUNT(a) from table
a - DateType

type mismatch;
 found   : Null(null)
 required: DateType.this.InternalType




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7234:
---

Assignee: (was: Apache Spark)

 When codegen on DateType defaultPrimitive will throw type mismatch exception
 

 Key: SPARK-7234
 URL: https://issues.apache.org/jira/browse/SPARK-7234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Chen Song

 When codegen on, the defaultPrimitive of DateType is null. This will rise 
 below error.
 select COUNT(a) from table
 a - DateType
 type mismatch;
  found   : Null(null)
  required: DateType.this.InternalType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519176#comment-14519176
 ] 

Apache Spark commented on SPARK-7234:
-

User 'kaka1992' has created a pull request for this issue:
https://github.com/apache/spark/pull/5778

 When codegen on DateType defaultPrimitive will throw type mismatch exception
 

 Key: SPARK-7234
 URL: https://issues.apache.org/jira/browse/SPARK-7234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Chen Song

 When codegen on, the defaultPrimitive of DateType is null. This will rise 
 below error.
 select COUNT(a) from table
 a - DateType
 type mismatch;
  found   : Null(null)
  required: DateType.this.InternalType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7234) When codegen on DateType defaultPrimitive will throw type mismatch exception

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7234:
---

Assignee: Apache Spark

 When codegen on DateType defaultPrimitive will throw type mismatch exception
 

 Key: SPARK-7234
 URL: https://issues.apache.org/jira/browse/SPARK-7234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Chen Song
Assignee: Apache Spark

 When codegen on, the defaultPrimitive of DateType is null. This will rise 
 below error.
 select COUNT(a) from table
 a - DateType
 type mismatch;
  found   : Null(null)
  required: DateType.this.InternalType



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7233) ClosureCleaner#clean blocks concurrent job submitter threads

2015-04-29 Thread Oleksii Kostyliev (JIRA)
Oleksii Kostyliev created SPARK-7233:


 Summary: ClosureCleaner#clean blocks concurrent job submitter 
threads
 Key: SPARK-7233
 URL: https://issues.apache.org/jira/browse/SPARK-7233
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.0
Reporter: Oleksii Kostyliev


{{org.apache.spark.util.ClosureCleaner#clean}} method contains logic to 
determine if Spark is run in interpreter mode: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L120

While this behavior is indeed valuable in particular situations, in addition to 
this it causes concurrent submitter threads to be blocked on a native call to 
{{java.lang.Class#forName0}} since it appears only 1 thread at a time can make 
the call.

This becomes a major issue when you have multiple threads concurrently 
submitting short-lived jobs. This is one of the patterns how we use Spark in 
production, and the number of parallel requests is expected to be quite high, 
up to a couple of thousand at a time.

A typical stacktrace of a blocked thread looks like:
{code}
http-bio-8091-exec-14 [BLOCKED] [DAEMON]

java.lang.Class.forName0(String, boolean, ClassLoader, Class) Class.java 
(native)
java.lang.Class.forName(String) Class.java:260
org.apache.spark.util.ClosureCleaner$.clean(Object, boolean) 
ClosureCleaner.scala:122
org.apache.spark.SparkContext.clean(Object, boolean) SparkContext.scala:1623
org.apache.spark.rdd.RDD.reduce(Function2) RDD.scala:883
org.apache.spark.rdd.RDD.takeOrdered(int, Ordering) RDD.scala:1240
org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike, int, 
Comparator) JavaRDDLike.scala:586
org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(int, Comparator) 
JavaRDDLike.scala:46
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6913) No suitable driver found loading JDBC dataframe using driver added by through SparkContext.addJar

2015-04-29 Thread Vyacheslav Baranov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519124#comment-14519124
 ] 

Vyacheslav Baranov commented on SPARK-6913:
---

The problem is in java.sql.DriverManager that doesn't see the drivers loaded by 
ClassLoaders other than bootstrap ClassLoader.

The solution would be to create a proxy driver included in Spark assembly that 
forwards all requests to wrapped driver.

I have a working fix for this issue and going to make pull request soon.

 No suitable driver found loading JDBC dataframe using driver added by 
 through SparkContext.addJar
 ---

 Key: SPARK-6913
 URL: https://issues.apache.org/jira/browse/SPARK-6913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Evan Yu

 val sc = new SparkContext(conf)
 sc.addJar(J:\mysql-connector-java-5.1.35.jar)
 val df = 
 sqlContext.jdbc(jdbc:mysql://localhost:3000/test_db?user=abcpassword=123, 
 table1)
 df.show()
 Folloing error:
 2015-04-14 17:04:39,541 [task-result-getter-0] WARN  
 org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 
 0, dev1.test.dc2.com): java.sql.SQLException: No suitable driver found for 
 jdbc:mysql://localhost:3000/test_db?user=abcpassword=123
   at java.sql.DriverManager.getConnection(DriverManager.java:689)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:158)
   at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:150)
   at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.init(JDBCRDD.scala:317)
   at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519158#comment-14519158
 ] 

Apache Spark commented on SPARK-7196:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5777

 decimal precision lost when loading DataFrame from JDBC
 ---

 Key: SPARK-7196
 URL: https://issues.apache.org/jira/browse/SPARK-7196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ken Geis

 I have a decimal database field that is defined as 10.2 (i.e. ##.##). 
 When I load it into Spark via sqlContext.jdbc(..), the type of the 
 corresponding field in the DataFrame is DecimalType, with precisionInfo None. 
 Because of that loss of precision information, SPARK-4176 is triggered when I 
 try to .saveAsTable(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7196:
---

Assignee: Apache Spark

 decimal precision lost when loading DataFrame from JDBC
 ---

 Key: SPARK-7196
 URL: https://issues.apache.org/jira/browse/SPARK-7196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ken Geis
Assignee: Apache Spark

 I have a decimal database field that is defined as 10.2 (i.e. ##.##). 
 When I load it into Spark via sqlContext.jdbc(..), the type of the 
 corresponding field in the DataFrame is DecimalType, with precisionInfo None. 
 Because of that loss of precision information, SPARK-4176 is triggered when I 
 try to .saveAsTable(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7196:
---

Assignee: (was: Apache Spark)

 decimal precision lost when loading DataFrame from JDBC
 ---

 Key: SPARK-7196
 URL: https://issues.apache.org/jira/browse/SPARK-7196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ken Geis

 I have a decimal database field that is defined as 10.2 (i.e. ##.##). 
 When I load it into Spark via sqlContext.jdbc(..), the type of the 
 corresponding field in the DataFrame is DecimalType, with precisionInfo None. 
 Because of that loss of precision information, SPARK-4176 is triggered when I 
 try to .saveAsTable(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7233) ClosureCleaner#clean blocks concurrent job submitter threads

2015-04-29 Thread Oleksii Kostyliev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519169#comment-14519169
 ] 

Oleksii Kostyliev commented on SPARK-7233:
--

To illustrate the issue, I performed a test against local Spark.
Attached is the screenshot from the Threads view in Yourkit profiler.
The test was generating only 20 concurrent requests.
As you can see, job submitter threads mainly spend their time being blocked by 
each other.

 ClosureCleaner#clean blocks concurrent job submitter threads
 

 Key: SPARK-7233
 URL: https://issues.apache.org/jira/browse/SPARK-7233
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.0
Reporter: Oleksii Kostyliev
 Attachments: blocked_threads_closurecleaner.png


 {{org.apache.spark.util.ClosureCleaner#clean}} method contains logic to 
 determine if Spark is run in interpreter mode: 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L120
 While this behavior is indeed valuable in particular situations, in addition 
 to this it causes concurrent submitter threads to be blocked on a native call 
 to {{java.lang.Class#forName0}} since it appears only 1 thread at a time can 
 make the call.
 This becomes a major issue when you have multiple threads concurrently 
 submitting short-lived jobs. This is one of the patterns how we use Spark in 
 production, and the number of parallel requests is expected to be quite high, 
 up to a couple of thousand at a time.
 A typical stacktrace of a blocked thread looks like:
 {code}
 http-bio-8091-exec-14 [BLOCKED] [DAEMON]
 java.lang.Class.forName0(String, boolean, ClassLoader, Class) Class.java 
 (native)
 java.lang.Class.forName(String) Class.java:260
 org.apache.spark.util.ClosureCleaner$.clean(Object, boolean) 
 ClosureCleaner.scala:122
 org.apache.spark.SparkContext.clean(Object, boolean) SparkContext.scala:1623
 org.apache.spark.rdd.RDD.reduce(Function2) RDD.scala:883
 org.apache.spark.rdd.RDD.takeOrdered(int, Ordering) RDD.scala:1240
 org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike, int, 
 Comparator) JavaRDDLike.scala:586
 org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(int, Comparator) 
 JavaRDDLike.scala:46
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7233) ClosureCleaner#clean blocks concurrent job submitter threads

2015-04-29 Thread Oleksii Kostyliev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksii Kostyliev updated SPARK-7233:
-
Attachment: blocked_threads_closurecleaner.png

 ClosureCleaner#clean blocks concurrent job submitter threads
 

 Key: SPARK-7233
 URL: https://issues.apache.org/jira/browse/SPARK-7233
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.0
Reporter: Oleksii Kostyliev
 Attachments: blocked_threads_closurecleaner.png


 {{org.apache.spark.util.ClosureCleaner#clean}} method contains logic to 
 determine if Spark is run in interpreter mode: 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala#L120
 While this behavior is indeed valuable in particular situations, in addition 
 to this it causes concurrent submitter threads to be blocked on a native call 
 to {{java.lang.Class#forName0}} since it appears only 1 thread at a time can 
 make the call.
 This becomes a major issue when you have multiple threads concurrently 
 submitting short-lived jobs. This is one of the patterns how we use Spark in 
 production, and the number of parallel requests is expected to be quite high, 
 up to a couple of thousand at a time.
 A typical stacktrace of a blocked thread looks like:
 {code}
 http-bio-8091-exec-14 [BLOCKED] [DAEMON]
 java.lang.Class.forName0(String, boolean, ClassLoader, Class) Class.java 
 (native)
 java.lang.Class.forName(String) Class.java:260
 org.apache.spark.util.ClosureCleaner$.clean(Object, boolean) 
 ClosureCleaner.scala:122
 org.apache.spark.SparkContext.clean(Object, boolean) SparkContext.scala:1623
 org.apache.spark.rdd.RDD.reduce(Function2) RDD.scala:883
 org.apache.spark.rdd.RDD.takeOrdered(int, Ordering) RDD.scala:1240
 org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike, int, 
 Comparator) JavaRDDLike.scala:586
 org.apache.spark.api.java.AbstractJavaRDDLike.takeOrdered(int, Comparator) 
 JavaRDDLike.scala:46
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7077) Binary processing hash table for aggregation

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7077.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Binary processing hash table for aggregation
 

 Key: SPARK-7077
 URL: https://issues.apache.org/jira/browse/SPARK-7077
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
 Fix For: 1.4.0


 Let's start with a hash table for aggregations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6838) Explore using Reference Classes instead of S4 objects

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6838:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Explore using Reference Classes instead of S4 objects
 -

 Key: SPARK-6838
 URL: https://issues.apache.org/jira/browse/SPARK-6838
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 The current RDD and PipelinedRDD are represented in S4 objects. R has a new 
 OO system: Reference Class (RC or R5). It seems to be a more message-passing 
 OO and instances are mutable objects. It is not an important issue, but it 
 should also require trivial work. It could also remove the kind-of awkward 
 @ operator in S4.
 R6 is also worth checking out. Feels closer to your ordinary object oriented 
 language. https://github.com/wch/R6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7157) Add approximate stratified sampling to DataFrame

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7157:
---
Description: def sampleBy(c

 Add approximate stratified sampling to DataFrame
 

 Key: SPARK-7157
 URL: https://issues.apache.org/jira/browse/SPARK-7157
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Joseph K. Bradley
Priority: Minor

 def sampleBy(c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519021#comment-14519021
 ] 

Apache Spark commented on SPARK-7202:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/5775

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7227) Support fillna / dropna in R DataFrame

2015-04-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7227:
--

 Summary: Support fillna / dropna in R DataFrame
 Key: SPARK-7227
 URL: https://issues.apache.org/jira/browse/SPARK-7227
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7232) Add a Substitution batch for spark sql analyzer

2015-04-29 Thread Fei Wang (JIRA)
Fei Wang created SPARK-7232:
---

 Summary: Add a Substitution batch for spark sql analyzer
 Key: SPARK-7232
 URL: https://issues.apache.org/jira/browse/SPARK-7232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang


Added a new batch named `Substitution` before Resolution batch. The motivation 
for this is there are kind of cases we want to do some substitution on the 
parsed logical plan before resolve it. 
Consider this two cases:
1 CTE, for cte we first build a row logical plan
'With Map(q1 - 'Subquery q1
 'Project ['key]
   'UnresolvedRelation [src], None)
 'Project [*]
  'Filter ('key = 5)
   'UnresolvedRelation [q1], None

In `With` logicalplan here is a map stored the (q1- subquery), we want first 
take off the with command and substitute the  q1 of UnresolvedRelation by the 
subquery

2 Another example is Window function, in window function user may define some 
windows, we also need substitute the window name of child by the concrete 
window. this should also done in the Substitution batch.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7225) CombineLimits optimizer does not work

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7225:
---

Assignee: (was: Apache Spark)

 CombineLimits optimizer does not work
 -

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7205) Support local ivy cache in --packages

2015-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-7205.

Resolution: Fixed
  Assignee: Burak Yavuz

 Support local ivy cache in --packages
 -

 Key: SPARK-7205
 URL: https://issues.apache.org/jira/browse/SPARK-7205
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Burak Yavuz
Assignee: Burak Yavuz
Priority: Critical
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7222) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet

2015-04-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7222:
---
Description: Added detailed mathematical derivation of how scaling and 
LeastSquaresAggregator work. Also refactored the code so the model is 
compressed based on the storage. We may try compress based on the prediction 
time. TODO: Add test that fail the test when the correction terms are not 
correctly computed.  (was: Added detailed mathematical derivation of how 
scaling and LeastSquaresAggregator work. Also refactored the code. TODO: Add 
test that fail the test when the correction terms are not correctly computed.)

 Added mathematical derivation in comment and compressed the model to 
 LinearRegression with ElasticNet
 -

 Key: SPARK-7222
 URL: https://issues.apache.org/jira/browse/SPARK-7222
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: DB Tsai

 Added detailed mathematical derivation of how scaling and 
 LeastSquaresAggregator work. Also refactored the code so the model is 
 compressed based on the storage. We may try compress based on the prediction 
 time. TODO: Add test that fail the test when the correction terms are not 
 correctly computed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7213) Exception while copying Hadoop config files due to permission issues

2015-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7213:
---
Component/s: YARN

 Exception while copying Hadoop config files due to permission issues
 

 Key: SPARK-7213
 URL: https://issues.apache.org/jira/browse/SPARK-7213
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Nishkam Ravi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-04-29 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518807#comment-14518807
 ] 

Imran Rashid commented on SPARK-5945:
-

[~kayousterhout] can you please clarify -- did you want to just hardcode to 4, 
or did you want to reuse {{spark.task.maxFailures}} for stage failures as well?

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6803) [SparkR] Support SparkR Streaming

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6803:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 [SparkR] Support SparkR Streaming
 -

 Key: SPARK-6803
 URL: https://issues.apache.org/jira/browse/SPARK-6803
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, Streaming
Reporter: Hao
 Fix For: 1.4.0


 Adds R API for Spark Streaming.
 A experimental version is presented in repo [1]. which follows the PySpark 
 streaming design. Also, this PR can be further broken down into sub task 
 issues.
 [1] https://github.com/hlin09/spark/tree/SparkR-streaming/ 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2015-04-29 Thread Nicolas PHUNG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519005#comment-14519005
 ] 

Nicolas PHUNG commented on SPARK-3601:
--

For GenericData.Array Avro, I use the following snippet from 
[Flink|https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/Serializers.java]:

{code}
// Avoid issue with avro array serialization 
https://issues.apache.org/jira/browse/FLINK-1391
public class Serializers {
/**
 * Special serializer for Java collections enforcing certain instance types.
 * Avro is serializing collections with an GenericData.Array type. Kryo 
is not able to handle
 * this type, so we use ArrayLists.
 */
public static class SpecificInstanceCollectionSerializerT extends 
java.util.ArrayList? extends CollectionSerializer implements Serializable {
private ClassT type;

public SpecificInstanceCollectionSerializer(ClassT type) {
this.type = type;
}

@Override
protected Collection create(Kryo kryo, Input input, ClassCollection 
type) {
return kryo.newInstance(this.type);
}

@Override
protected Collection createCopy(Kryo kryo, Collection original) {
return kryo.newInstance(this.type);
}
}
}
{code}

And I have register in Kryo with the following scala code :

{code}
kryo.register(classOf[GenericData.Array[_]], new 
SpecificInstanceCollectionSerializer(classOf[java.util.ArrayList[_]]));
{code}

I hope this help.

 Kryo NPE for output operations on Avro complex Objects even after registering.
 --

 Key: SPARK-3601
 URL: https://issues.apache.org/jira/browse/SPARK-3601
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: local, standalone cluster
Reporter: mohan gaddam

 Kryo serializer works well when avro objects has simple data. but when the 
 same avro object has complex data(like unions/arrays) kryo fails while output 
 operations. but mappings are good. Note that i have registered all the Avro 
 generated classes with kryo. Im using Java as programming language.
 when used complex message throws NPE, stack trace as follows:
 ==
 ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
 ms.0 
 org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
 while getting task result: com.esotericsoftware.kryo.KryoException: 
 java.lang.NullPointerException 
 Serialization trace: 
 value (xyz.Datum) 
 data (xyz.ResMsg) 
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
  
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  
 at scala.Option.foreach(Option.scala:236) 
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
  
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
  
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
 at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
 at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  
 In the above exception, Datum and ResMsg are project specific classes 
 generated by avro using the below avdl snippet:
 ==
 record KeyValueObject { 
 union{boolean, int, long, float, double, bytes, string} name; 
 union {boolean, int, long, float, double, bytes, string, 
 arrayunion{boolean, int, long, float, double, bytes, string, 
 KeyValueObject}, 

[jira] [Updated] (SPARK-6833) Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6833:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Extend `addPackage` so that any given R file can be sourced in the worker 
 before functions are run.
 ---

 Key: SPARK-6833
 URL: https://issues.apache.org/jira/browse/SPARK-6833
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 Similar to how extra python files or packages can be specified (in zip / egg 
 formats), it will be good to support the ability to add extra R files to the 
 executors working directory.
 One thing that needs to be investigated is if this will just work out of the 
 box using the spark-submit flag --files ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6813) SparkR style guide

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6813:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 SparkR style guide
 --

 Key: SPARK-6813
 URL: https://issues.apache.org/jira/browse/SPARK-6813
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman

 We should develop a SparkR style guide document based on the some of the 
 guidelines we use and some of the best practices in R.
 Some examples of R style guide are:
 http://r-pkgs.had.co.nz/r.html#style 
 http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html
 A related issue is to work on a automatic style checking tool. 
 https://github.com/jimhester/lintr seems promising
 We could have a R style guide based on the one from google [1], and adjust 
 some of them with the conversation in Spark:
 1. Line Length: maximum 100 characters
 2. no limit on function name (API should be similar as in other languages)
 3. Allow S4 objects/methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7222) Added mathematical derivation in comment to LinearRegression with ElasticNet

2015-04-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-7222:
--

 Summary: Added mathematical derivation in comment to 
LinearRegression with ElasticNet
 Key: SPARK-7222
 URL: https://issues.apache.org/jira/browse/SPARK-7222
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: DB Tsai


Added detailed mathematical derivation of how scaling and 
LeastSquaresAggregator work. Also refactored the code. TODO: Add test that fail 
the test when the correction terms are not correctly computed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7225) CombineLimits optimizer does not work

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7225:
---

Assignee: Apache Spark

 CombineLimits optimizer does not work
 -

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei
Assignee: Apache Spark

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7080) Binary processing based aggregate operator

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7080.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Binary processing based aggregate operator
 --

 Key: SPARK-7080
 URL: https://issues.apache.org/jira/browse/SPARK-7080
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6809:
-
Priority: Major  (was: Critical)

 Make numPartitions optional in pairRDD APIs
 ---

 Key: SPARK-6809
 URL: https://issues.apache.org/jira/browse/SPARK-6809
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7224) Mock repositories for testing with --packages

2015-04-29 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-7224:
--

 Summary: Mock repositories for testing with --packages
 Key: SPARK-7224
 URL: https://issues.apache.org/jira/browse/SPARK-7224
 Project: Spark
  Issue Type: Test
  Components: Spark Submit
Reporter: Burak Yavuz
Priority: Critical


Create mock repositories (folders with jars and pom in MAven format) for 
testing --packages without the need for internet connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6820:
-
Priority: Critical  (was: Major)

 Convert NAs to null type in SparkR DataFrames
 -

 Key: SPARK-6820
 URL: https://issues.apache.org/jira/browse/SPARK-6820
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman
Priority: Critical

 While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
 handle missing values or NAs.
 We should convert NAs to SparkSQL's null type to handle the conversion 
 correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6799) Add dataframe examples for SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6799:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-7228

 Add dataframe examples for SparkR
 -

 Key: SPARK-6799
 URL: https://issues.apache.org/jira/browse/SPARK-6799
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical

 We should add more data frame usage examples for SparkR . This can be similar 
 to the python examples at 
 https://github.com/apache/spark/blob/1b2aab8d5b9cc2ff702506038bd71aa8debe7ca0/examples/src/main/python/sql.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7202:
---

Assignee: (was: Apache Spark)

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6809) Make numPartitions optional in pairRDD APIs

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6809:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Make numPartitions optional in pairRDD APIs
 ---

 Key: SPARK-6809
 URL: https://issues.apache.org/jira/browse/SPARK-6809
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7225) CombineLimits do not works

2015-04-29 Thread Zhongshuai Pei (JIRA)
Zhongshuai Pei created SPARK-7225:
-

 Summary: CombineLimits do not works
 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6826) `hashCode` support for arbitrary R objects

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6826:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 `hashCode` support for arbitrary R objects
 --

 Key: SPARK-6826
 URL: https://issues.apache.org/jira/browse/SPARK-6826
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Shivaram Venkataraman

 From the SparkR JIRA
 digest::digest looks interesting, but it seems to be more heavyweight than 
 our requirements. One relatively easy way to do this is to serialize the 
 given R object into a string (serialize(object, ascii=T)) and then just call 
 the string hashCode function on this. FWIW it looks like digest follows a 
 similar strategy where the md5sum / shasum etc. are calculated on serialized 
 objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7032) SparkSQL incorrect results when using UNION/EXCEPT with GROUP BY clause

2015-04-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518944#comment-14518944
 ] 

Reynold Xin commented on SPARK-7032:


cc [~cloud_fan] would you have time to take a look at this?


 SparkSQL incorrect results when using UNION/EXCEPT with GROUP BY clause
 ---

 Key: SPARK-7032
 URL: https://issues.apache.org/jira/browse/SPARK-7032
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.2, 1.3.1
Reporter: Lior Chaga

 When using UNION/EXCEPT clause with GROUP BY clause in spark sql, results do 
 not match expected.
 In the following example, only 1 record should be in first table and not in 
 second (as when grouping by key field, the counter for key=1 is 10 in both 
 tables).
 Each of the clauses by itself is working properly when running exclusively. 
 {code}
 //import com.addthis.metrics.reporter.config.ReporterConfig;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.api.java.JavaSQLContext;
 import org.apache.spark.sql.api.java.Row;
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.ArrayList;
 import java.util.List;
 public class SimpleApp {
 public static void main(String[] args) throws IOException {
 SparkConf conf = new SparkConf().setAppName(Simple Application)
 .setMaster(local[1]);
 JavaSparkContext sc = new JavaSparkContext(conf);
 ListMyObject firstList = new ArrayListMyObject(2);
 firstList.add(new MyObject(1, 10));
 firstList.add(new MyObject(2, 10));
 ListMyObject secondList = new ArrayListMyObject(3);
 secondList.add(new MyObject(1, 4));
 secondList.add(new MyObject(1, 6));
 secondList.add(new MyObject(2, 8));
 JavaRDDMyObject firstRdd = sc.parallelize(firstList);
 JavaRDDMyObject secondRdd = sc.parallelize(secondList);
 JavaSQLContext sqlc = new JavaSQLContext(sc);
 sqlc.applySchema(firstRdd, 
 MyObject.class).registerTempTable(table1);
 sqlc.sqlContext().cacheTable(table1);
 sqlc.applySchema(secondRdd, 
 MyObject.class).registerTempTable(table2);
 sqlc.sqlContext().cacheTable(table2);
 ListRow firstMinusSecond = sqlc.sql(
 SELECT key, counter FROM table1  +
 EXCEPT  +
 SELECT key, SUM(counter) FROM table2  +
 GROUP BY key ).collect();
 System.out.println(num of rows in first but not in second = [ + 
 firstMinusSecond.size() + ]);
 sc.close();
 System.exit(0);
 }
 public static class MyObject implements Serializable {
 public MyObject(Integer key, Integer counter) {
 this.key = key;
 this.counter = counter;
 }
 private Integer key;
 private Integer counter;
 public Integer getKey() {
 return key;
 }
 public void setKey(Integer key) {
 this.key = key;
 }
 public Integer getCounter() {
 return counter;
 }
 public void setCounter(Integer counter) {
 this.counter = counter;
 }
 }
 }
 {code}
 Same example, give or take, with DataFrames - when not using groupBy works 
 good, with groupBy I get 2 rows instead of 1:
 {code}
 SparkConf conf = new SparkConf().setAppName(Simple Application)
 .setMaster(local[1]);
 JavaSparkContext sc = new JavaSparkContext(conf);
 ListMyObject firstList = new ArrayListMyObject(2);
 firstList.add(new MyObject(1, 10));
 firstList.add(new MyObject(2, 10));
 ListMyObject secondList = new ArrayListMyObject(3);
 secondList.add(new MyObject(1, 10));
 secondList.add(new MyObject(2, 8));
 JavaRDDMyObject firstRdd = sc.parallelize(firstList);
 JavaRDDMyObject secondRdd = sc.parallelize(secondList);
 SQLContext sqlc = new SQLContext(sc);
 DataFrame firstDataFrame = sqlc.createDataFrame(firstRdd, MyObject.class);
 DataFrame secondDataFrame = sqlc.createDataFrame(secondRdd, MyObject.class);
 Row[] collect = firstDataFrame.except(secondDataFrame).collect();
 System.out.println(num of rows in first but not in second = [ + 
 collect.length + ]);
 DataFrame secondAggregated = secondDataFrame.groupBy(key).sum(counter); 
 Row[] collectAgg = firstDataFrame.except(secondAggregated).collect();
 System.out.println(num of rows in first but not in second = [ + 
 collectAgg.length + ]); // should be 1 row, but there are 2
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Created] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-7230:


 Summary: Make RDD API private in SparkR for Spark 1.4
 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical


This ticket proposes making the RDD API in SparkR private for the 1.4 release. 
The motivation for doing so are discussed in a larger design document aimed at 
a more top-down design of the SparkR APIs. A first cut that discusses 
motivation and proposed changes can be found at http://goo.gl/GLHKZI

The main points in that document that relate to this ticket are:
- The RDD API requires knowledge of the distributed system and is pretty low 
level. This is not very suitable for a number of R users who are used to more 
high-level packages that work out of the box.
- The RDD implementation in SparkR is not fully robust right now: we are 
missing features like spilling for aggregation, handling partitions which don't 
fit in memory etc. There are further limitations like lack of hashCode for 
non-native types etc. which might affect user experience.

The only change we will make for now is to not export the RDD functions as 
public methods in the SparkR package and I will create another ticket for 
discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7224) Mock repositories for testing with --packages

2015-04-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7224:
---
Assignee: Burak Yavuz

 Mock repositories for testing with --packages
 -

 Key: SPARK-7224
 URL: https://issues.apache.org/jira/browse/SPARK-7224
 Project: Spark
  Issue Type: Test
  Components: Spark Submit
Reporter: Burak Yavuz
Assignee: Burak Yavuz
Priority: Critical

 Create mock repositories (folders with jars and pom in MAven format) for 
 testing --packages without the need for internet connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7076) Binary processing compact tuple representation

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7076.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Binary processing compact tuple representation
 --

 Key: SPARK-7076
 URL: https://issues.apache.org/jira/browse/SPARK-7076
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Josh Rosen
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7232) Add a Substitution batch for spark sql analyzer

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519090#comment-14519090
 ] 

Apache Spark commented on SPARK-7232:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/5776

 Add a Substitution batch for spark sql analyzer
 ---

 Key: SPARK-7232
 URL: https://issues.apache.org/jira/browse/SPARK-7232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang

 Added a new batch named `Substitution` before Resolution batch. The 
 motivation for this is there are kind of cases we want to do some 
 substitution on the parsed logical plan before resolve it. 
 Consider this two cases:
 1 CTE, for cte we first build a row logical plan
 'With Map(q1 - 'Subquery q1
  'Project ['key]
'UnresolvedRelation [src], None)
  'Project [*]
   'Filter ('key = 5)
'UnresolvedRelation [q1], None
 In `With` logicalplan here is a map stored the (q1- subquery), we want first 
 take off the with command and substitute the  q1 of UnresolvedRelation by the 
 subquery
 2 Another example is Window function, in window function user may define some 
 windows, we also need substitute the window name of child by the concrete 
 window. this should also done in the Substitution batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7226) Support math functions in R DataFrame

2015-04-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-7226:
--

 Summary: Support math functions in R DataFrame
 Key: SPARK-7226
 URL: https://issues.apache.org/jira/browse/SPARK-7226
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7204.

   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2

 Call sites in UI are not accurate for DataFrame operations
 --

 Key: SPARK-7204
 URL: https://issues.apache.org/jira/browse/SPARK-7204
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical
 Fix For: 1.3.2, 1.4.0


 Spark core computes callsites by climbing up the stack until we reach the 
 stack frame at the boundary of user code and spark code. The way we compute 
 whether a given frame is internal (Spark) or user code does not work 
 correctly with the new dataframe API.
 Once the scope work goes in, we'll have a nicer way to express units of 
 operator scope, but until then there is a simple fix where we just make sure 
 the SQL internal classes are also skipped as we climb up the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7232) Add a Substitution batch for spark sql analyzer

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7232:
---

Assignee: Apache Spark

 Add a Substitution batch for spark sql analyzer
 ---

 Key: SPARK-7232
 URL: https://issues.apache.org/jira/browse/SPARK-7232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang
Assignee: Apache Spark

 Added a new batch named `Substitution` before Resolution batch. The 
 motivation for this is there are kind of cases we want to do some 
 substitution on the parsed logical plan before resolve it. 
 Consider this two cases:
 1 CTE, for cte we first build a row logical plan
 'With Map(q1 - 'Subquery q1
  'Project ['key]
'UnresolvedRelation [src], None)
  'Project [*]
   'Filter ('key = 5)
'UnresolvedRelation [q1], None
 In `With` logicalplan here is a map stored the (q1- subquery), we want first 
 take off the with command and substitute the  q1 of UnresolvedRelation by the 
 subquery
 2 Another example is Window function, in window function user may define some 
 windows, we also need substitute the window name of child by the concrete 
 window. this should also done in the Substitution batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518778#comment-14518778
 ] 

Apache Spark commented on SPARK-7223:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5768

 Rename RPC askWithReply - askWithReply, sendWithReply - ask
 -

 Key: SPARK-7223
 URL: https://issues.apache.org/jira/browse/SPARK-7223
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Current naming is too confusing between askWithReply and sendWithReply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7229:
---

Assignee: Apache Spark

 SpecificMutableRow should take integer type as internal representation for 
 DateType
 ---

 Key: SPARK-7229
 URL: https://issues.apache.org/jira/browse/SPARK-7229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark

 {code}
   test(test DATE types in cache) {
 val rows = TestSQLContext.jdbc(urlWithUserAndPass, 
 TEST.TIMETYPES).collect()
 TestSQLContext.jdbc(urlWithUserAndPass, 
 TEST.TIMETYPES).cache().registerTempTable(mycached_date)
 val cachedRows = sql(select * from mycached_date).collect()
 assert(rows(0).getAs[java.sql.Date](1) === 
 java.sql.Date.valueOf(1996-01-01))
 assert(cachedRows(0).getAs[java.sql.Date](1) === 
 java.sql.Date.valueOf(1996-01-01))
   }
 {code}
 java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252)
   at 
 org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208)
   at 
 org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)
 {panel}
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5890) Add FeatureDiscretizer

2015-04-29 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518862#comment-14518862
 ] 

Xusen Yin commented on SPARK-5890:
--

I start to do it.

 Add FeatureDiscretizer
 --

 Key: SPARK-5890
 URL: https://issues.apache.org/jira/browse/SPARK-5890
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 A `FeatureDiscretizer` takes a column with continuous features and outputs a 
 column with binned categorical features.
 {code}
 val fd = new FeatureDiscretizer()
   .setInputCol(age)
   .setNumBins(32)
   .setOutputCol(ageBins)
 {code}
 This should an automatic feature discretizer, which uses a simple algorithm 
 like approximate quantiles to discretize features. It should set the ML 
 attribute correctly in the output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7222) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet

2015-04-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7222:
---
Summary: Added mathematical derivation in comment and compressed the model 
to LinearRegression with ElasticNet  (was: Added mathematical derivation in 
comment to LinearRegression with ElasticNet)

 Added mathematical derivation in comment and compressed the model to 
 LinearRegression with ElasticNet
 -

 Key: SPARK-7222
 URL: https://issues.apache.org/jira/browse/SPARK-7222
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: DB Tsai

 Added detailed mathematical derivation of how scaling and 
 LeastSquaresAggregator work. Also refactored the code. TODO: Add test that 
 fail the test when the correction terms are not correctly computed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7215) Make repartition and coalesce a part of the query plan

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7215.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Burak Yavuz

 Make repartition and coalesce a part of the query plan
 --

 Key: SPARK-7215
 URL: https://issues.apache.org/jira/browse/SPARK-7215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Burak Yavuz
Assignee: Burak Yavuz
Priority: Critical
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6814) Support sorting for any data type in SparkR

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6814:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Support sorting for any data type in SparkR
 ---

 Key: SPARK-6814
 URL: https://issues.apache.org/jira/browse/SPARK-6814
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Critical

 I get various return status == 0 is false and unimplemented type errors 
 trying to get data out of any rdd with top() or collect(). The errors are not 
 consistent. I think spark is installed properly because some operations do 
 work. I apologize if I'm missing something easy or not providing the right 
 diagnostic info – I'm new to SparkR, and this seems to be the only resource 
 for SparkR issues.
 Some logs:
 {code}
 Browse[1] top(estep.rdd, 1L)
 Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
   unimplemented type 'list' in 'orderVector1'
 Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
 Execution halted
 15/02/13 19:11:57 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
 org.apache.spark.SparkException: R computation failed with
  Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
   unimplemented type 'list' in 'orderVector1'
 Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
 Execution halted
   at edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/02/13 19:11:57 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 14, 
 localhost): org.apache.spark.SparkException: R computation failed with
  Error in order(unlist(part, recursive = FALSE), decreasing = !ascending) : 
   unimplemented type 'list' in 'orderVector1'
 Calls: do.call ... Reduce - Anonymous - func - FUN - FUN - order
 Execution halted
 edu.berkeley.cs.amplab.sparkr.BaseRRDD.compute(RRDD.scala:69)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7229) SpecificMutableRow should take integer type as internal representation for DateType

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7229:
---

Assignee: (was: Apache Spark)

 SpecificMutableRow should take integer type as internal representation for 
 DateType
 ---

 Key: SPARK-7229
 URL: https://issues.apache.org/jira/browse/SPARK-7229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 {code}
   test(test DATE types in cache) {
 val rows = TestSQLContext.jdbc(urlWithUserAndPass, 
 TEST.TIMETYPES).collect()
 TestSQLContext.jdbc(urlWithUserAndPass, 
 TEST.TIMETYPES).cache().registerTempTable(mycached_date)
 val cachedRows = sql(select * from mycached_date).collect()
 assert(rows(0).getAs[java.sql.Date](1) === 
 java.sql.Date.valueOf(1996-01-01))
 assert(cachedRows(0).getAs[java.sql.Date](1) === 
 java.sql.Date.valueOf(1996-01-01))
   }
 {code}
 java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getInt(SpecificMutableRow.scala:252)
   at 
 org.apache.spark.sql.columnar.IntColumnStats.gatherStats(ColumnStats.scala:208)
   at 
 org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:56)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
   at 
 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:722)
 {panel}
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7225) CombineLimits do not works

2015-04-29 Thread Zhongshuai Pei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7225:
--
Description: 
The  optimized logical plan of  select key from (select key from src limit 
100) t2 limit 10  like that 
{quote}
== Optimized Logical Plan ==
Limit 10
 Limit 100
  Project [key#3]
   MetastoreRelation default, src, None
{quote}

It did not combineLimits

  was:
The  optimized logical plan of  select key from (select key from src limit 
100) t2 limit 10  like that 
{quote}
== Optimized Logical Plan ==
Limit 10
 Limit 100
  Project [key#3]
   MetastoreRelation default, src, None
{quote}

It did not 


 CombineLimits do not works
 --

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7228) SparkR public API for 1.4 release

2015-04-29 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-7228:


 Summary: SparkR public API for 1.4 release
 Key: SPARK-7228
 URL: https://issues.apache.org/jira/browse/SPARK-7228
 Project: Spark
  Issue Type: Umbrella
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical


This in an umbrella ticket to track the public APIs and documentation to be 
released as a part of SparkR in the 1.4 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6832) Handle partial reads in SparkR JVM to worker communication

2015-04-29 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-6832:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Handle partial reads in SparkR JVM to worker communication
 --

 Key: SPARK-6832
 URL: https://issues.apache.org/jira/browse/SPARK-6832
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Minor

 After we move to use socket between R worker and JVM, it's possible that 
 readBin() in R will return partial results (for example, interrupted by 
 signal).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7188) Support math functions in DataFrames in Python

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7188.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Support math functions in DataFrames in Python
 --

 Key: SPARK-7188
 URL: https://issues.apache.org/jira/browse/SPARK-7188
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Burak Yavuz
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7225) CombineLimits optimizer does not work

2015-04-29 Thread Zhongshuai Pei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7225:
--
Summary: CombineLimits optimizer does not work  (was: CombineLimits 
optimizer does not works)

 CombineLimits optimizer does not work
 -

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7225) CombineLimits optimizer does not works

2015-04-29 Thread Zhongshuai Pei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7225:
--
Summary: CombineLimits optimizer does not works  (was: CombineLimits in 
Optimizer does not works)

 CombineLimits optimizer does not works
 --

 Key: SPARK-7225
 URL: https://issues.apache.org/jira/browse/SPARK-7225
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhongshuai Pei

 The  optimized logical plan of  select key from (select key from src limit 
 100) t2 limit 10  like that 
 {quote}
 == Optimized Logical Plan ==
 Limit 10
  Limit 100
   Project [key#3]
MetastoreRelation default, src, None
 {quote}
 It did not combineLimits



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7222) Added mathematical derivation in comment and compressed the model to LinearRegression with ElasticNet

2015-04-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7222:
---
Issue Type: Improvement  (was: Documentation)

 Added mathematical derivation in comment and compressed the model to 
 LinearRegression with ElasticNet
 -

 Key: SPARK-7222
 URL: https://issues.apache.org/jira/browse/SPARK-7222
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: DB Tsai

 Added detailed mathematical derivation of how scaling and 
 LeastSquaresAggregator work. Also refactored the code. TODO: Add test that 
 fail the test when the correction terms are not correctly computed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7217) Add configuration to disable stopping of SparkContext when StreamingContext.stop()

2015-04-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519593#comment-14519593
 ] 

Sean Owen commented on SPARK-7217:
--

FWIW I'd expect the current behavior since things like {{InputStream.close()}} 
would always close the underlying stream, if one exists, in the JDK. I assume 
you're not proposing changing that. How about a new optional param to control 
whether to stop the underlying stream?  Or make the implementation of 
SparkContext for a specific app un-stoppable instead?

 Add configuration to disable stopping of SparkContext when 
 StreamingContext.stop()
 --

 Key: SPARK-7217
 URL: https://issues.apache.org/jira/browse/SPARK-7217
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das

 In environments like notebooks, the SparkContext is managed by the underlying 
 infrastructure and it is expected that the SparkContext will not be stopped. 
 However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive 
 side-effect. This JIRA is to add a configuration in SparkConf that sets the 
 default StreamingContext stop behavior. It should be such that the existing 
 behavior does not change for existing users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7236) AkkaUtils askWithReply sleeps indefinitely when a timeout exception is thrown

2015-04-29 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-7236:

Attachment: SparkLongSleepAfterTimeout.scala

Attaching some code to reproduce this issue.

 AkkaUtils askWithReply sleeps indefinitely when a timeout exception is thrown
 -

 Key: SPARK-7236
 URL: https://issues.apache.org/jira/browse/SPARK-7236
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Bryan Cutler
Priority: Trivial
  Labels: quickfix
 Attachments: SparkLongSleepAfterTimeout.scala


 When {{AkkaUtils.askWithReply}} gets a TimeoutException, the default 
 parameters {{maxAttempts = 1}} and {{retryInterval = Int.Max}} lead to the 
 thread sleeping for a good while.
 I noticed this issue when testing for SPARK-6980 and using this function 
 without invoking Spark jobs, so perhaps it acts differently in another 
 context.
 If this function is on its final attempt to ask and it fails, it should 
 return immediately.  Also, perhaps a better default {{retryInterval}} would 
 be {{0}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6989) Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various arcane compiler errors

2015-04-29 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519730#comment-14519730
 ] 

Michael Allman commented on SPARK-6989:
---

Thank you for looking into this. I've been away on vacation for the past week. 
I've set aside our Scala 2.11 deployment in favor of the 2.10 deployment. We'll 
probably try again with Spark 1.4. Cheers.

 Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various 
 arcane compiler errors
 --

 Key: SPARK-6989
 URL: https://issues.apache.org/jira/browse/SPARK-6989
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
 Environment: Java 1.8.0_40 on Ubuntu 14.04.1
Reporter: Michael Allman
Assignee: Prashant Sharma
 Attachments: spark_repl_2.11_errors.txt


 When starting the Spark 1.3 spark-shell compiled for Scala 2.11, I get a 
 random assortment of compiler errors. I will attach a transcript.
 One thing I've noticed is that they seem to be less frequent when I increase 
 the driver heap size to 5 GB or so. By comparison, the Spark 1.1 spark-shell 
 on Scala 2.10 has been rock solid with a 512 MB heap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-04-29 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519650#comment-14519650
 ] 

Kay Ousterhout commented on SPARK-5945:
---

I wanted to hardcode to 4 (totally agree with the sentiment you expressed 
earlier in this thread, that it doesn't make sense / is very confusing to 
re-use a config parameter for two different things).

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519600#comment-14519600
 ] 

Sean Owen commented on SPARK-7189:
--

I thought that was the point, but maybe I misunderstand: you have to err on the 
side of re-processing a file even if it doesn't look like it changed. Right?

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7209) Adding new Manning book Spark in Action to the official Spark Webpage

2015-04-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7209.
--
Resolution: Fixed

Looks like Patrick just added it, yes.

 Adding new Manning book Spark in Action to the official Spark Webpage
 ---

 Key: SPARK-7209
 URL: https://issues.apache.org/jira/browse/SPARK-7209
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Aleksandar Dragosavljevic
Priority: Minor
 Attachments: Spark in Action.jpg

   Original Estimate: 1h
  Remaining Estimate: 1h

 Manning Publications is developing a book Spark in Action written by Marko 
 Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be 
 great if the book could be added to the list of books at the official Spark 
 Webpage (https://spark.apache.org/documentation.html).
 This book teaches readers to use Spark for stream and batch data processing. 
 It starts with an introduction to the Spark architecture and ecosystem 
 followed by a taste of Spark's command line interface. Readers then discover 
 the most fundamental concepts and abstractions of Spark, particularly 
 Resilient Distributed Datasets (RDDs) and the basic data transformations that 
 RDDs provide. The first part of the book also introduces you to writing Spark 
 applications using the the core APIs. Next, you learn about different Spark 
 components: how to work with structured data using Spark SQL, how to process 
 near-real time data with Spark Streaming, how to apply machine learning 
 algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped 
 data using Spark GraphX, and a clear introduction to Spark clustering.
 The book is already available to the public as a part of our Manning Early 
 Access Program (MEAP) where we deliver chapters to the public as soon as they 
 are written. We believe it will offer significant support to the Spark users 
 and the community.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7223) Rename RPC askWithReply - askWithReply, sendWithReply - ask

2015-04-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7223.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Rename RPC askWithReply - askWithReply, sendWithReply - ask
 -

 Key: SPARK-7223
 URL: https://issues.apache.org/jira/browse/SPARK-7223
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.4.0


 Current naming is too confusing between askWithReply and sendWithReply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7194:
-
  Component/s: MLlib
 Priority: Minor  (was: Major)
Affects Version/s: 1.3.1

Go ahead and set priority and component, and maybe affects version for 
improvements. 

You can write {{Vectors.dense(array).toSparse}} - that may be simpler still and 
doesn't need a new method?

Or this could also be a little simpler with array.zipWithIndex.filter(_._1 != 
0.0).map(_.swap)

 Vectors factors method for sparse vectors should accept the output of 
 zipWithIndex
 --

 Key: SPARK-7194
 URL: https://issues.apache.org/jira/browse/SPARK-7194
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Juliet Hougland
Priority: Minor

 Let's say we have an RDD of Array[Double] where zero values are explictly 
 recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
 of sparse vectors, we currently have to:
 arr_doubles.map{ array =
val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
 tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
 Vectors.sparse(arrray.length, indexElem)
 }
 Notice that there is a map step at the end to switch the order of the index 
 and the element value after .zipWithIndex. There should be a factory method 
 on the Vectors class that allows you to avoid this flipping of tuple elements 
 when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7238) Upgrade protobuf-java (com.google.protobuf) version from 2.4.1 to 2.5.0

2015-04-29 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-7238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Favio Vázquez closed SPARK-7238.

Resolution: Won't Fix

 Upgrade protobuf-java (com.google.protobuf) version from 2.4.1 to 2.5.0
 ---

 Key: SPARK-7238
 URL: https://issues.apache.org/jira/browse/SPARK-7238
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04. Apache Mesos in cluster mode with  HDFS 
 from cloudera 2.5.0-cdh5.3.3. 
Reporter: Favio Vázquez
Priority: Blocker
  Labels: 2.5.0-cdh5.3.3, CDH5, HDFS, Mesos
 Fix For: 1.3.1, 1.3.0


 This upgrade is needed when building spark for CDH5  2.5.0-cdh5.3.3 due to 
 incompatibilities in the protobuf version used by com.google.protobuf and the 
 one used in hadoop. The default version of protobuf is set to 2.4.1 in the 
 global properties, and this is stated in the pom.xml file:
 !-- In theory we need not directly depend on protobuf since Spark does not 
 directly use it. However, when building with Hadoop/YARN 2.2 Maven doesn't 
 correctly bump the protobuf version up from the one Mesos gives. For now we 
 include this variable to explicitly bump the version when building with YARN. 
 It would be nice to figure out why Maven can't resolve this correctly (like 
 SBT does). --
 So this upgrade will only be affecting the com.google.protobuf version of 
 java-protobuf. Tested for the Cloudera distribution 2.5.0-cdh5.3.3 using 
 Mesos 0.22.0 in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7248) Random number generators for DataFrames

2015-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7248:


 Summary: Random number generators for DataFrames
 Key: SPARK-7248
 URL: https://issues.apache.org/jira/browse/SPARK-7248
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng
Assignee: Burak Yavuz


This is an umbrella JIRA for random number generators for DataFrames. The 
initial set of RNGs would be `rand` and `randn`, which takes a seed.

{code}
df.select(*, rand(11L).as(rand))
{code}

Where those methods should live is TBD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7247) Add Pandas' shift method to the Dataframe API

2015-04-29 Thread Olivier Girardot (JIRA)
Olivier Girardot created SPARK-7247:
---

 Summary: Add Pandas' shift method to the Dataframe API
 Key: SPARK-7247
 URL: https://issues.apache.org/jira/browse/SPARK-7247
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 1.3.1
Reporter: Olivier Girardot
Priority: Minor


Spark's DataFrame provide several of the capabilities of Pandas and R Dataframe 
but in a distributed fashion which makes it almost easy to rewrite Pandas code 
to Spark.

Almost, but there is a feature that's difficult to workaround right now, that's 
the shift method : 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html

I'm working with some data scientists that are using this feature a lot in 
order to check for rows equality after a sort by some keys. Example (in 
pandas) :

{code}
df['delta'] = (df.START_DATE.shift(-1) - df.END_DATE).astype('timedelta64[D]')
{code}

I think this would be troublesome to add, I don't even know if this change 
would be do-able. But as a user, it would be useful for me.

Olivier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7250) computeInverse for RowMatrix

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7250:
---

Assignee: (was: Apache Spark)

 computeInverse for RowMatrix
 

 Key: SPARK-7250
 URL: https://issues.apache.org/jira/browse/SPARK-7250
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Stephanie Rivera





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7250) computeInverse for RowMatrix

2015-04-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520053#comment-14520053
 ] 

Apache Spark commented on SPARK-7250:
-

User 'SpyderRiverA' has created a pull request for this issue:
https://github.com/apache/spark/pull/5785

 computeInverse for RowMatrix
 

 Key: SPARK-7250
 URL: https://issues.apache.org/jira/browse/SPARK-7250
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Stephanie Rivera





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7250) computeInverse for RowMatrix

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7250:
---

Assignee: Apache Spark

 computeInverse for RowMatrix
 

 Key: SPARK-7250
 URL: https://issues.apache.org/jira/browse/SPARK-7250
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Stephanie Rivera
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520058#comment-14520058
 ] 

Patrick Wendell commented on SPARK-7230:


I think this is a good idea. We should expose a narrower higher level API here 
and then look at user feedback to understand whether we want to support 
something lower level. From my experience with PySpark, it was a huge effort 
(probably more than 5X the original contribution) to actually implement 
everything in the lowest level Spark API's. And for the R community I don't 
think those low level ETL API's are that useful. So I'd be inclined to keep it 
simple at the beginning and then add complexity if we see new user demand.

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7252) Add support for creating new Hive and HBase delegation tokens

2015-04-29 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-7252:
---

 Summary: Add support for creating new Hive and HBase delegation 
tokens
 Key: SPARK-7252
 URL: https://issues.apache.org/jira/browse/SPARK-7252
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.1
Reporter: Hari Shreedharan


In SPARK-5342, support is being added for long running apps to be able to write 
to HDFS, but this does not work for Hive and HBase. We need to add the same 
support for these too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7237) Many user provided closures are not actually cleaned

2015-04-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7237:
---

Assignee: Andrew Or  (was: Apache Spark)

 Many user provided closures are not actually cleaned
 

 Key: SPARK-7237
 URL: https://issues.apache.org/jira/browse/SPARK-7237
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or

 It appears that many operations throughout Spark actually do not actually 
 clean the closures provided by the user.
 Simple reproduction:
 {code}
 def test(): Unit = {
   sc.parallelize(1 to 10).mapPartitions { iter = return; iter }.collect()
 }
 {code}
 Clearly, the inner closure is not serializable, but when we serialize it we 
 should expect the ClosureCleaner to fail fast and complain loudly about 
 return statements. Instead, we get a mysterious stack trace:
 {code}
 java.io.NotSerializableException: java.lang.Object
 Serialization stack:
   - object not serializable (class: java.lang.Object, value: 
 java.lang.Object@6db4b914)
   - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: 
 nonLocalReturnKey1$1, type: class java.lang.Object)
   - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, function1)
   - field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$4, type: 
 interface scala.Function1)
   - object (class org.apache.spark.rdd.RDD$$anonfun$14, function3)
   at 
 org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:314)
 {code}
 What might have caused this? If you look at the code for mapPartitions, 
 you'll notice that we never explicitly clean the closure passed in by the 
 user. Instead, we only wrap it in another closure and clean the outer one:
 {code}
 def mapPartitions[U: ClassTag](
   f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): 
 RDD[U] = {
 val func = (context: TaskContext, index: Int, iter: Iterator[T]) = 
 f(iter)
 new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
   }
 {code}
 This is not sufficient, however, because the user provided closure is 
 actually a field of the outer closure, which doesn't get cleaned. If we 
 rewrite the above by cleaning the inner closure preemptively:
 {code}
 def mapPartitions[U: ClassTag](
   f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): 
 RDD[U] = {
 val cleanedFunc = clean(f)
 new MapPartitionsRDD(
   this,
   (context: TaskContext, index: Int, iter: Iterator[T]) = 
 cleanedFunc(iter),
   preservesPartitioning)
   }
 {code}
 Then we get the exception that we would expect by running the test() example 
 above:
 {code}
 org.apache.spark.SparkException: Return statements aren't allowed in Spark 
 closures
   at 
 org.apache.spark.util.ReturnStatementFinder$$anon$1.visitTypeInsn(ClosureCleaner.scala:357)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:215)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1759)
   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:640)
 {code}
 This needs to be done in a few places throughout the Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7237) Many user provided closures are not actually cleaned

2015-04-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7237:
-
Description: 
It appears that many operations throughout Spark actually do not actually clean 
the closures provided by the user.

Simple reproduction:
{code}
def test(): Unit = {
  sc.parallelize(1 to 10).mapPartitions { iter = return; iter }.collect()
}
{code}
Clearly, the inner closure is not serializable, but when we serialize it we 
should expect the ClosureCleaner to fail fast and complain loudly about return 
statements. Instead, we get a mysterious stack trace:
{code}
java.io.NotSerializableException: java.lang.Object
Serialization stack:
- object not serializable (class: java.lang.Object, value: 
java.lang.Object@6db4b914)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: 
nonLocalReturnKey1$1, type: class java.lang.Object)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, function1)
- field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$4, type: 
interface scala.Function1)
- object (class org.apache.spark.rdd.RDD$$anonfun$14, function3)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:314)
{code}

What might have caused this? If you look at the code for mapPartitions, you'll 
notice that we never explicitly clean the closure passed in by the user. 
Instead, we only wrap it in another closure and clean only the outer one:
{code}
def mapPartitions[U: ClassTag](
  f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): 
RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) = f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
  }
{code}

This is not sufficient, however, because the user provided closure is actually 
a field of the outer closure, and this inner closure doesn't get cleaned. If we 
rewrite the above by cleaning the inner closure preemptively, as we have done 
in other places:

{code}
def mapPartitions[U: ClassTag](
  f: Iterator[T] = Iterator[U], preservesPartitioning: Boolean = false): 
RDD[U] = {
val cleanedFunc = clean(f)
new MapPartitionsRDD(
  this,
  (context: TaskContext, index: Int, iter: Iterator[T]) = 
cleanedFunc(iter),
  preservesPartitioning)
  }
{code}

Then we get the exception that we would expect by running the test() example 
above:
{code}
org.apache.spark.SparkException: Return statements aren't allowed in Spark 
closures
at 
org.apache.spark.util.ReturnStatementFinder$$anon$1.visitTypeInsn(ClosureCleaner.scala:357)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
 Source)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
 Source)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:215)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1759)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:640)
{code}

It seems to me that we simply forgot to do this in a few places (e.g. 
mapPartitions, keyBy, aggregateByKey), because in other similar places we do 
this correctly (e.g. groupBy, combineByKey, zipPartitions).

  was:
It appears that many operations throughout Spark actually do not actually clean 
the closures provided by the user.

Simple reproduction:
{code}
def test(): Unit = {
  sc.parallelize(1 to 10).mapPartitions { iter = return; iter }.collect()
}
{code}
Clearly, the inner closure is not serializable, but when we serialize it we 
should expect the ClosureCleaner to fail fast and complain loudly about return 
statements. Instead, we get a mysterious stack trace:
{code}
java.io.NotSerializableException: java.lang.Object
Serialization stack:
- object not serializable (class: java.lang.Object, value: 
java.lang.Object@6db4b914)
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: 
nonLocalReturnKey1$1, type: class java.lang.Object)
- object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, function1)
- field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$4, type: 
interface scala.Function1)
- object (class org.apache.spark.rdd.RDD$$anonfun$14, function3)
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 

  1   2   3   >