date:20141112


[ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207793#comment-14207793
 ] 

Sean Owen commented on SPARK-4341:
--

The problem is that the number of executors is then not appropriate for 
anything but the first action that is computed. 

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4359) Empty classifier in avro-mapred is misinterpreted in SBT

2014-11-12 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207794#comment-14207794
 ] 

Andrew Or commented on SPARK-4359:
--

Ok, I reverted commit 
https://github.com/apache/spark/commit/78887f94a0ae9cdcfb851910ab9c7d51a1ef2acb 
for branch-1.1 for now.

 Empty classifier in avro-mapred is misinterpreted in SBT
 --

 Key: SPARK-4359
 URL: https://issues.apache.org/jira/browse/SPARK-4359
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 In the parent pom, avro.mapred.classifier is set to hadoop2 for Yarn but 
 not otherwise set. As a result, when an application that uses 
 spark-hive_2.10 as a module is built with SBT, it will try to resolve a jar 
 that is literally called the following:
 {code}
 [warn]  Maven Repository: tried
 [warn]   
 http://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.7.6/avro-mapred-1.7.6-${avro.mapred.classifier}.jar
 [warn]::
 [warn]::  FAILED DOWNLOADS::
 [warn]:: ^ see resolution messages for details  ^ ::
 [warn]::
 [warn]:: org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar
 [warn]::
 sbt.ResolveException: download failed: 
 org.apache.avro#avro-mapred;1.7.6!avro-mapred.jar
 {code}
 This is because avro.mapred.classifier is not a variable according to SBT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

[
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207801#comment-14207801
]

Apache Spark commented on SPARK-2426:
-

User 'debasish83' has created a pull request for this issue:
https://github.com/apache/spark/pull/3221

Quadratic Minimization for MLlib ALS

Key: SPARK-2426
URL: https://issues.apache.org/jira/browse/SPARK-2426
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
Original Estimate: 504h
Remaining Estimate: 504h

Current ALS supports least squares and nonnegative least squares.
I presented ADMM and IPM based Quadratic Minimization solvers to be used for
the following ALS problems:
1. ALS with bounds
2. ALS with L1 regularization
3. ALS with Equality constraint and bounds
Initial runtime comparisons are presented at Spark Summit.
http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
Based on Xiangrui's feedback I am currently comparing the ADMM based
Quadratic Minimization solvers with IPM based QpSolvers and the default
ALS/NNLS. I will keep updating the runtime comparison results.
For integration the detailed plan is as follows:
1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
2. Integrate QuadraticMinimizer in mllib ALS

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4353) Delete the val that never used in Catalog


 [ 
https://issues.apache.org/jira/browse/SPARK-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4353.
--
Resolution: Not a Problem

 Delete the val that never used in Catalog
 -

 Key: SPARK-4353
 URL: https://issues.apache.org/jira/browse/SPARK-4353
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 dbName in Catalog never used, like that 
 {
val (dbName, tblName) = processDatabaseAndTableName(databaseName, 
 tableName);
tables -= tblName
 }
 I think it should be deleted,it should be val tblName = 
 processDatabaseAndTableName(databaseName, tableName)._2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically

2014-11-12 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207816#comment-14207816
 ] 

Hong Shen commented on SPARK-4341:
--

After the first action computed,  we can set set nimPartition for the following 
HadoopRDD.

So the following HadoopRDD's partitions won't less than num-executors, and it 
will prevent  wasting of resources. On the other hand if  the following 
HadoopRDD's partitions  is much bigger than num-executors, we can reset 
numExecuors to ApplicaitonMaster and allocate new executors.

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4341) Spark need to set num-executors automatically

2014-11-12 Thread Hong Shen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207816#comment-14207816
]

Hong Shen edited comment on SPARK-4341 at 11/12/14 8:40 AM:

After the first action computed, we can set nimPartition for the following
HadoopRDD.

So the following HadoopRDD's partitions won't less than num-executors, and it
will prevent wasting of resources. On the other hand if the following
HadoopRDD's partitions is much bigger than num-executors, we can reset
numExecuors to ApplicaitonMaster and allocate new executors.

was (Author: shenhong):
After the first action computed, we can set set nimPartition for the following
HadoopRDD.

Spark need to set num-executors automatically
-

Key: SPARK-4341
URL: https://issues.apache.org/jira/browse/SPARK-4341
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

The mapreduce job can set maptask automaticlly, but in spark, we have to set
num-executors, executor memory and cores. It's difficult for users to set
these args, especially for the users want to use spark sql. So when user
havn't set num-executors, spark should set num-executors automatically
accroding to the input partitions.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3206) Error in PageRank values


[ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207822#comment-14207822
 ] 

Ankur Dave commented on SPARK-3206:
---

I just tested this with the standalone version of PageRank that was introduced 
in SPARK-3427, and it seems to be fixed, so I'm closing this.

{code}
scala val e = sc.parallelize(List(
  (1, 2), (1, 3), (3, 2), (3, 4), (5, 3), (6, 7), (7, 8), (8, 9), (9, 7)))

scala val g = Graph.fromEdgeTuples(e.map(kv = (kv._1.toLong, kv._2.toLong)), 
0)

scala g.pageRank(0.0001).vertices.collect.foreach(println)
(8,1.2808550959634413)
(1,0.15)
(9,1.2387268204156412)
(2,0.358781244)
(3,0.341249994)
(4,0.295031247)
(5,0.15)
(6,0.15)
(7,1.330417786200011)

scala g.staticPageRank(100).vertices.collect.foreach(println)
(8,1.2803346052504254)
(1,0.15)
(9,1.2381240056453071)
(2,0.358781244)
(3,0.341249994)
(4,0.295031247)
(5,0.15)
(6,0.15)
(7,1.3299054047985106)
{code}

 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana

 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 || Node1 || Node2 ||
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 || NodeID  || NodeName  ||
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  
 {{val ranksI = PageRank.run(graph,100).vertices}} 
 I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3206) Error in PageRank values


 [ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3206.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Ankur Dave

 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana
Assignee: Ankur Dave
 Fix For: 1.2.0


 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 || Node1 || Node2 ||
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 || NodeID  || NodeName  ||
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  
 {{val ranksI = PageRank.run(graph,100).vertices}} 
 I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4360) task only execute on one node when spark on yarn

2014-11-12 Thread seekerak (JIRA)

seekerak created SPARK-4360:
---

 Summary: task only execute on one node when spark on yarn
 Key: SPARK-4360
 URL: https://issues.apache.org/jira/browse/SPARK-4360
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: seekerak


hadoop version: hadoop 2.0.3-alpha
spark version: 1.0.2

when i run spark jobs on yarn, i found all the task only run on one node, my 
cluster has 4 nodes, executors has 3, but only one has task, the others hasn't, 
my command like this :

/opt/hadoopcluster/spark-1.0.2-bin-hadoop2/bin/spark-submit --class 
org.sr.scala.Spark_LineCount_G0 --executor-memory 2G --num-executors 12 
--master yarn-cluster /home/Spark_G0.jar /data /output/ou_1

is there any one knows why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-12 Thread zzc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207826#comment-14207826
 ] 

zzc commented on SPARK-2468:


Hi, Aaron Davidson, I am sure that I ran my last test with the patch #3155 
applied.
configuration :
spark.shuffle.consolidateFilestrue
spark.storage.memoryFraction  0.2
spark.shuffle.memoryFraction  0.2
spark.shuffle.file.buffer.kb  100
spark.reducer.maxMbInFlight   48
spark.shuffle.blockTransferServicenetty
spark.shuffle.io.mode nio
spark.shuffle.io.connectionTimeout120
spark.shuffle.manager SORT

spark.shuffle.io.preferDirectBufs   true
spark.shuffle.io.maxRetries 3
spark.shuffle.io.retryWaitMs5000
spark.shuffle.io.maxUsableCores 3

command:
--num-executors 17 --executor-memory 12g --executor-cores 3

If spark.shuffle.io.preferDirectBufs=false, it's OK.

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-12 Thread zzc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207839#comment-14207839
 ] 

zzc commented on SPARK-2468:


Hi, Aaron Davidson, can you describe your test, including the environment, 
configuration, data volume?

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4251) Add Restricted Boltzmann machine(RBM) algorithm to MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-4251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207855#comment-14207855
 ] 

Apache Spark commented on SPARK-4251:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/3222

 Add Restricted Boltzmann machine(RBM) algorithm to MLlib
 

 Key: SPARK-4251
 URL: https://issues.apache.org/jira/browse/SPARK-4251
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4361) SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration

Shixiong Zhu created SPARK-4361:
---

 Summary: SparkContext HadoopRDD is not clear about how to use a 
Hadoop Configuration
 Key: SPARK-4361
 URL: https://issues.apache.org/jira/browse/SPARK-4361
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor


When I answered this question: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-did-the-RDD-union-work-td18686.html,
 I found SparkContext did not explain how to use a Hadoop Configuration. More 
docs to clarify that Configuration will be put into a Broadcast is better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4355.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3220
[https://github.com/apache/spark/pull/3220]

 OnlineSummarizer doesn't merge mean correctly
 -

 Key: SPARK-4355
 URL: https://issues.apache.org/jira/browse/SPARK-4355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0


 It happens when the mean on one side is zero. I will send an PR with some 
 code clean-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4355) OnlineSummarizer doesn't merge mean correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-4355:
--

Reopen this issue because we haven't fixed branch-1.1 and branch-1.0 yet.

 OnlineSummarizer doesn't merge mean correctly
 -

 Key: SPARK-4355
 URL: https://issues.apache.org/jira/browse/SPARK-4355
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0


 It happens when the mean on one side is zero. I will send an PR with some 
 code clean-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4361) SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration


[ 
https://issues.apache.org/jira/browse/SPARK-4361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207882#comment-14207882
 ] 

Apache Spark commented on SPARK-4361:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3225

 SparkContext HadoopRDD is not clear about how to use a Hadoop Configuration
 ---

 Key: SPARK-4361
 URL: https://issues.apache.org/jira/browse/SPARK-4361
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor
  Labels: doc, easyfix

 When I answered this question: 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-did-the-RDD-union-work-td18686.html,
  I found SparkContext did not explain how to use a Hadoop Configuration. More 
 docs to clarify that Configuration will be put into a Broadcast is better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically

[
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207885#comment-14207885
]

Sean Owen commented on SPARK-4341:
--

So I think some of this is already done by Spark. For example, the number of
partitions is determined in the same way that Hadoop does, and carries through
a pipeline of transformations.

Some of this is not necessarily the right thing to do. For example I could be
running several transformations at once, and trying to match each's parallelism
to the number of executors may be inefficient, not only because it may mean
making partitions that are excessively small or large, but because it may
require a shuffle, which is expensive.

Finally I think the issue of resource usage is better dealt with by
increasing/decreasing the number of executors dynamically in response to demand
or load, and there is already work in progress on those. So maybe that covers
what you are thinking of already.

Spark need to set num-executors automatically
-

Key: SPARK-4341
URL: https://issues.apache.org/jira/browse/SPARK-4341
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4360) task only execute on one node when spark on yarn


[ 
https://issues.apache.org/jira/browse/SPARK-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207894#comment-14207894
 ] 

Sean Owen commented on SPARK-4360:
--

I don't think there's enough info here; this maybe should have been a question 
on the list first.

Is there more than 1 partition in the input? did more than 1 executor actually 
allocate? are you definitely observing tasks running and not some 
single-threaded process on the driver?

 task only execute on one node when spark on yarn
 

 Key: SPARK-4360
 URL: https://issues.apache.org/jira/browse/SPARK-4360
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: seekerak

 hadoop version: hadoop 2.0.3-alpha
 spark version: 1.0.2
 when i run spark jobs on yarn, i found all the task only run on one node, my 
 cluster has 4 nodes, executors has 3, but only one has task, the others 
 hasn't, my command like this :
 /opt/hadoopcluster/spark-1.0.2-bin-hadoop2/bin/spark-submit --class 
 org.sr.scala.Spark_LineCount_G0 --executor-memory 2G --num-executors 12 
 --master yarn-cluster /home/Spark_G0.jar /data /output/ou_1
 is there any one knows why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4341) Spark need to set num-executors automatically

2014-11-12 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207899#comment-14207899
 ] 

Hong Shen commented on SPARK-4341:
--

My main point is when running spark (especially spark SQL), not all user want 
to set parallelism  to match executors, we can provide a easy way for them to 
use spark.

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen

 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib

2014-11-12 Thread Ashutosh Trivedi (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207925#comment-14207925
]

Ashutosh Trivedi commented on SPARK-4038:
-

The questions raised are valid and we want community to discuss it.

This algorithm deals with categorical data, In my knowledge it uses the
simplest approach by calculating frequency of each attribute in the data set.
Some of the people in community are already doing the review and I am working
on it.

I did not find any other algorithm which work on categorical data to find
outliers. If you are aware of any other algorithm which is well known please
share with us.

Outlier Detection Algorithm for MLlib
-

Key: SPARK-4038
URL: https://issues.apache.org/jira/browse/SPARK-4038
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Ashutosh Trivedi
Priority: Minor

The aim of this JIRA is to discuss about which parallel outlier detection
algorithms can be included in MLlib.
The one which I am familiar with is Attribute Value Frequency (AVF). It
scales linearly with the number of data points and attributes, and relies on
a single data scan. It is not distance based and well suited for categorical
data. In original paper a parallel version is also given, which is not
complected to implement. I am working on the implementation and soon submit
the initial code for review.
Here is the Link for the paper
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
As pointed out by Xiangrui in discussion
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
There are other algorithms also. Lets discuss about which will be more
general and easily paralleled.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4038) Outlier Detection Algorithm for MLlib

2014-11-12 Thread Ashutosh Trivedi (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207925#comment-14207925
]

Ashutosh Trivedi edited comment on SPARK-4038 at 11/12/14 10:53 AM:

The questions raised are valid and we want community to discuss it.

This algorithm deals with categorical data, It uses the simplest approach by
calculating frequency of each attribute in the data set. Some of the people in
community are already doing the review and I am working on it.

I did not find any other algorithm which work on categorical data to find
outliers. If you are aware of any other algorithm which is well known please
share with us.

was (Author: rusty):
The questions raised are valid and we want community to discuss it.

I did not find any other algorithm which work on categorical data to find
outliers. If you are aware of any other algorithm which is well known please
share with us.

Outlier Detection Algorithm for MLlib
-

Key: SPARK-4038
URL: https://issues.apache.org/jira/browse/SPARK-4038
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Ashutosh Trivedi
Priority: Minor

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4362) Make prediction probability available in Naive Baye's Model

2014-11-12 Thread Jatinpreet Singh (JIRA)

Jatinpreet Singh created SPARK-4362:
---

 Summary: Make prediction probability available in Naive Baye's 
Model
 Key: SPARK-4362
 URL: https://issues.apache.org/jira/browse/SPARK-4362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jatinpreet Singh
Priority: Minor
 Fix For: 1.2.0


There is currently no way to get the posterior probability of a prediction with 
Naive Baye's model during prediction. This should be made available along with 
the label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4363) The Broadcast example is out of date

Shixiong Zhu created SPARK-4363:
---

 Summary: The Broadcast example is out of date
 Key: SPARK-4363
 URL: https://issues.apache.org/jira/browse/SPARK-4363
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Shixiong Zhu
Priority: Trivial


The Broadcast example is out of date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4363) The Broadcast example is out of date


[ 
https://issues.apache.org/jira/browse/SPARK-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207942#comment-14207942
 ] 

Apache Spark commented on SPARK-4363:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3226

 The Broadcast example is out of date
 

 Key: SPARK-4363
 URL: https://issues.apache.org/jira/browse/SPARK-4363
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Shixiong Zhu
Priority: Trivial

 The Broadcast example is out of date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong

Shixiong Zhu created SPARK-4364:
---

 Summary: Some variable types in 
org.apache.spark.streaming.JavaAPISuite are wrong
 Key: SPARK-4364
 URL: https://issues.apache.org/jira/browse/SPARK-4364
 Project: Spark
  Issue Type: Test
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Trivial


Because of the type erase, the unit tests will pass. However, the wrong 
variable types will confuse people. The locations of these variables can be 
found in my PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong


[ 
https://issues.apache.org/jira/browse/SPARK-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207981#comment-14207981
 ] 

Sean Owen commented on SPARK-4364:
--

This is already covered in SPARK-4297

 Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
 

 Key: SPARK-4364
 URL: https://issues.apache.org/jira/browse/SPARK-4364
 Project: Spark
  Issue Type: Test
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Trivial
  Labels: unit-test

 Because of the type erase, the unit tests will pass. However, the wrong 
 variable types will confuse people. The locations of these variables can be 
 found in my PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4362) Make prediction probability available in NaiveBayesModel


 [ 
https://issues.apache.org/jira/browse/SPARK-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4362:
-
Summary: Make prediction probability available in NaiveBayesModel  (was: 
Make prediction probability available in Naive Baye's Model)

 Make prediction probability available in NaiveBayesModel
 

 Key: SPARK-4362
 URL: https://issues.apache.org/jira/browse/SPARK-4362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jatinpreet Singh
Priority: Minor
  Labels: naive-bayes
 Fix For: 1.2.0


 There is currently no way to get the posterior probability of a prediction 
 with Naive Baye's model during prediction. This should be made available 
 along with the label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong


[ 
https://issues.apache.org/jira/browse/SPARK-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207995#comment-14207995
 ] 

Apache Spark commented on SPARK-4364:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3227

 Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
 

 Key: SPARK-4364
 URL: https://issues.apache.org/jira/browse/SPARK-4364
 Project: Spark
  Issue Type: Test
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Trivial
  Labels: unit-test

 Because of the type erase, the unit tests will pass. However, the wrong 
 variable types will confuse people. The locations of these variables can be 
 found in my PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2867) saveAsHadoopFile() in PairRDDFunction.scala should allow use other OutputCommiter class

2014-11-12 Thread Romi Kuntsman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207997#comment-14207997
 ] 

Romi Kuntsman commented on SPARK-2867:
--

In the latest code, it seems to be resolved

 // Use configured output committer if already set
if (conf.getOutputCommitter == null) {
hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
}

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L934

 saveAsHadoopFile() in PairRDDFunction.scala should allow use other 
 OutputCommiter class
 ---

 Key: SPARK-2867
 URL: https://issues.apache.org/jira/browse/SPARK-2867
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Joseph Su
Priority: Minor

 The saveAsHadoopFile() in PairRDDFunction.scala hard-coded the 
 OutputCommitter class as FileOutputCommitter because of the following code in 
 the source:
hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
  However, OutputCommitter is a changeable option in regular Hadoop MapReduce 
 program. Users can specify mapred.output.committer.class to change the 
 committer class used by other Hadoop programs.
   The saveAsHadoopFile() function should remove this hard-coded assignment 
 and provide a way to specify the OutputCommitte used here. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208028#comment-14208028
 ] 

Cristian Opris commented on SPARK-3633:
---

FWIW I get this as well, with a very straightforward job and setup.

Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, 
shuffle.spill=true

50GB dataset on ext4, spread over 7000 files, hence the coalescing below

The jobs is only doing: input.coalesce(72, false).groupBy(key).count

The groupBy is successful then I get the dreaded fetch error on count stage 
(oddly enough), but it seems to me that's when it does the actual shuffling for 
groupBy ?



 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi
Priority: Critical

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library

2014-11-12 Thread Yash Datta (JIRA)

Yash Datta created SPARK-4365:
-

 Summary: Remove unnecessary filter call on records returned from 
parquet library
 Key: SPARK-4365
 URL: https://issues.apache.org/jira/browse/SPARK-4365
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.2.0


Since parquet library has been updated , we no longer need to filter the 
records returned from parquet library for null records , as now the library 
skips those :

from 
parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java


  public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;

while (!recordFound) {
  // no more records left
  if (current = total) { return false; }

  try {
checkRead();
currentValue = recordReader.read();
current ++; 
if (recordReader.shouldSkipCurrentRecord()) {
  // this record is being filtered via the filter2 package
  if (DEBUG) LOG.debug(skipping record);
  continue;
}   

if (currentValue == null) {
  // only happens with FilteredRecordReader at end of block
  current = totalCountLoadedSoFar;
  if (DEBUG) LOG.debug(filtered record reader reached end of block);
  continue;
}   
  recordFound = true;

if (DEBUG) LOG.debug(read value:  + currentValue);
  } catch (RuntimeException e) {
throw new ParquetDecodingException(format(Can not read value at %d in 
block %d in file %s, current, currentBlock, file), e); 
  }   
}   
return true;
  }





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library


[ 
https://issues.apache.org/jira/browse/SPARK-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208068#comment-14208068
 ] 

Apache Spark commented on SPARK-4365:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/3229

 Remove unnecessary filter call on records returned from parquet library
 ---

 Key: SPARK-4365
 URL: https://issues.apache.org/jira/browse/SPARK-4365
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.2.0


 Since parquet library has been updated , we no longer need to filter the 
 records returned from parquet library for null records , as now the library 
 skips those :
 from 
 parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
   public boolean nextKeyValue() throws IOException, InterruptedException {
 boolean recordFound = false;
 while (!recordFound) {
   // no more records left
   if (current = total) { return false; }
   try {
 checkRead();
 currentValue = recordReader.read();
 current ++; 
 if (recordReader.shouldSkipCurrentRecord()) {
   // this record is being filtered via the filter2 package
   if (DEBUG) LOG.debug(skipping record);
   continue;
 }   
 if (currentValue == null) {
   // only happens with FilteredRecordReader at end of block
   current = totalCountLoadedSoFar;
   if (DEBUG) LOG.debug(filtered record reader reached end of block);
   continue;
 }   
   recordFound = true;
 if (DEBUG) LOG.debug(read value:  + currentValue);
   } catch (RuntimeException e) {
 throw new ParquetDecodingException(format(Can not read value at %d 
 in block %d in file %s, current, currentBlock, file), e); 
   }   
 }   
 return true;
   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4320) JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

2014-11-12 Thread Corey J. Nolet (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208084#comment-14208084
]

Corey J. Nolet commented on SPARK-4320:
---

Since this is a simple change, I wanted to work on this myself to get more
familiar with the code base. Could someone w/ the proper privileges give me
access to be able to assign this ticket to myself?

JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

Key: SPARK-4320
URL: https://issues.apache.org/jira/browse/SPARK-4320
Project: Spark
Issue Type: Improvement
Components: Input/Output, Spark Core
Reporter: Corey J. Nolet
Fix For: 1.1.1, 1.2.0

I am outputting data to Accumulo using a custom OutputFormat. I have tried
using saveAsNewHadoopFile() and that works- though passing an empty path is a
bit weird. Being that it isn't really a file I'm storing, but rather a
generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset()
method, though I'm not at all interested in using the legacy mapred API.
Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think
there should be two ways of calling into this method. Instead of forcing the
user to always set up the Job object explicitly, I'm in the camp of having
the following method signature:
saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass :
Class[? extends OutputFormat], conf : Configuration). This way, if I'm
writing spark jobs that are going from Hadoop back into Hadoop, I can
construct my Configuration once.
Perhaps an overloaded method signature could be:
saveAsNewHadoopDataset(job : Job)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4366) Aggregation Optimization

Cheng Hao created SPARK-4366:


 Summary: Aggregation Optimization
 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4367) Process the distinct value before shuffling for aggregation

Cheng Hao created SPARK-4367:


 Summary: Process the distinct value before shuffling for 
aggregation
 Key: SPARK-4367
 URL: https://issues.apache.org/jira/browse/SPARK-4367
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


Most of aggregate function(e.g average) with distinct value will requires all 
of the records in the same group to be shuffled into a single node, however, as 
part of the optimization, those records can be partially aggregated before 
shuffling, that probably reduces the overhead of shuffling significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4233) Simplify the Aggregation Function implementation


 [ 
https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4233:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

 Simplify the Aggregation Function implementation
 

 Key: SPARK-4233
 URL: https://issues.apache.org/jira/browse/SPARK-4233
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, the UDAF implementation is quite complicated, and we have to 
 provide distinct  non-distinct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4367) Process the distinct value before shuffling for aggregation


 [ 
https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4367:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

 Process the distinct value before shuffling for aggregation
 -

 Key: SPARK-4367
 URL: https://issues.apache.org/jira/browse/SPARK-4367
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Most of aggregate function(e.g average) with distinct value will requires 
 all of the records in the same group to be shuffled into a single node, 
 however, as part of the optimization, those records can be partially 
 aggregated before shuffling, that probably reduces the overhead of shuffling 
 significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3633) Fetches failure observed after SPARK-2711


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208028#comment-14208028
 ] 

Cristian Opris edited comment on SPARK-3633 at 11/12/14 3:20 PM:
-

FWIW I get this as well, with a very straightforward job and setup.

Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, 
shuffle.spill=true

50GB dataset on ext4, spread over 7000 files, hence the coalescing below

The jobs is only doing: input.coalesce(72, false).groupBy(key).count

The groupBy is successful then I get the dreaded fetch error on count stage 
(oddly enough), but it seems to me that's when it does the actual shuffling for 
groupBy ?

EDIT: This might be due to Full GC on the executors during the shuffle block 
transfer phase. What's interesting is that it doesn't go OOM and the same 
amount is collected every time. (Old gen is 1.5 GB)

2014-11-12T07:17:06.899-0800: 477.697: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1355469K-1301675K(1398272K)] 
1603789K-1301675K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6565240 
secs] [Times: user=3.35 sys=0.00, real=0.66 secs] 
2014-11-12T07:17:07.751-0800: 478.549: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1301681K-1268312K(1398272K)] 
1550001K-1268312K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.5821160 
secs] [Times: user=3.16 sys=0.00, real=0.58 secs] 
2014-11-12T07:17:08.495-0800: 479.294: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1268314K-1300497K(1398272K)] 
1516634K-1300497K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6400670 
secs] [Times: user=4.07 sys=0.01, real=0.64 secs]



was (Author: onetoinfin...@yahoo.com):
FWIW I get this as well, with a very straightforward job and setup.

Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, 
shuffle.spill=true

50GB dataset on ext4, spread over 7000 files, hence the coalescing below

The jobs is only doing: input.coalesce(72, false).groupBy(key).count

The groupBy is successful then I get the dreaded fetch error on count stage 
(oddly enough), but it seems to me that's when it does the actual shuffling for 
groupBy ?



 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi
Priority: Critical

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Updated] (SPARK-3056) Sort-based Aggregation


 [ 
https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-3056:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-4366

 Sort-based Aggregation
 --

 Key: SPARK-3056
 URL: https://issues.apache.org/jira/browse/SPARK-3056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, SparkSQL only support the hash-based aggregation, which may cause 
 OOM if too many identical keys in the input tuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3633) Fetches failure observed after SPARK-2711


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208028#comment-14208028
 ] 

Cristian Opris edited comment on SPARK-3633 at 11/12/14 3:36 PM:
-

FWIW I get this as well, with a very straightforward job and setup.

Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, 
shuffle.spill=true

50GB dataset on ext4, spread over 7000 files, hence the coalescing below

The jobs is only doing: input.coalesce(72, false).groupBy(key).count

The groupBy is successful then I get the dreaded fetch error on count stage 
(oddly enough), but it seems to me that's when it does the actual shuffling for 
groupBy ?

EDIT: This might be due to Full GC on the executors during the shuffle block 
transfer phase. What's interesting is that it doesn't go OOM and the same 
amount is collected every time. (Old gen is 1.5 GB)

2014-11-12T07:17:06.899-0800: 477.697: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1355469K-1301675K(1398272K)] 
1603789K-1301675K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6565240 
secs] [Times: user=3.35 sys=0.00, real=0.66 secs] 
2014-11-12T07:17:07.751-0800: 478.549: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1301681K-1268312K(1398272K)] 
1550001K-1268312K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.5821160 
secs] [Times: user=3.16 sys=0.00, real=0.58 secs] 
2014-11-12T07:17:08.495-0800: 479.294: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1268314K-1300497K(1398272K)] 
1516634K-1300497K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6400670 
secs] [Times: user=4.07 sys=0.01, real=0.64 secs]

EDIT2: Changing to G1 collector actually causes it to go OOM. 
This must be related somehow to the number of shuffle files and hence perhaps 
open buffers as lowering the number of reducers from 72 to 10 runs without 
issues (note I'm using consolidated shuffle files). 

14/11/12 07:30:53 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Connection manager future execution context-2,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
at 
org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
at 
org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
at 
org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
at 
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$2.apply(BlockFetcherIterator.scala:124)
at 
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$2.apply(BlockFetcherIterator.scala:121)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)





was (Author: onetoinfin...@yahoo.com):
FWIW I get this as well, with a very straightforward job and setup.

Spark 1.1.0, executors configured to 2GB, storage.fraction=0.2, 
shuffle.spill=true

50GB dataset on ext4, spread over 7000 files, hence the coalescing below

The jobs is only doing: input.coalesce(72, false).groupBy(key).count

The groupBy is successful then I get the dreaded fetch error on count stage 
(oddly enough), but it seems to me that's when it does the actual shuffling for 
groupBy ?

EDIT: This might be due to Full GC on the executors during the shuffle block 
transfer phase. What's interesting is that it doesn't go OOM and the same 
amount is collected every time. (Old gen is 1.5 GB)

2014-11-12T07:17:06.899-0800: 477.697: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1355469K-1301675K(1398272K)] 
1603789K-1301675K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6565240 
secs] [Times: user=3.35 sys=0.00, real=0.66 secs] 
2014-11-12T07:17:07.751-0800: 478.549: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1301681K-1268312K(1398272K)] 
1550001K-1268312K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.5821160 
secs] [Times: user=3.16 sys=0.00, real=0.58 secs] 
2014-11-12T07:17:08.495-0800: 479.294: [Full GC [PSYoungGen: 
248320K-0K(466432K)] [ParOldGen: 1268314K-1300497K(1398272K)] 
1516634K-1300497K(1864704K) [PSPermGen: 39031K-39031K(39424K)], 0.6400670 
secs] [Times: user=4.07 sys=0.01, real=0.64 secs]


 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug

[jira] [Commented] (SPARK-1014) MultilogisticRegressionWithSGD


[ 
https://issues.apache.org/jira/browse/SPARK-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208226#comment-14208226
 ] 

Sean Owen commented on SPARK-1014:
--

I'm curious if this is still active -- where was the PR? was this just 
one-vs-all LR ?

 MultilogisticRegressionWithSGD
 --

 Key: SPARK-1014
 URL: https://issues.apache.org/jira/browse/SPARK-1014
 Project: Spark
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Kun Yang

 Multilogistic Regression With SGD based on mllib packages
 Use labeledpoint, gradientDescent to train the model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1245) Can't read EMR HBase cluster from properly built Cloudera Spark Cluster.


 [ 
https://issues.apache.org/jira/browse/SPARK-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1245.
--
Resolution: Not a Problem

I'm guessing this is now either obsolete, or, a case of matching HBase / Hadoop 
versions exactly. Spark should be provided, and not marking as such may mean 
the Spark Hadoop / cluster Hadoop / HBase Hadoop deps are colliding.

 Can't read EMR HBase cluster from properly built Cloudera Spark Cluster.
 

 Key: SPARK-1245
 URL: https://issues.apache.org/jira/browse/SPARK-1245
 Project: Spark
  Issue Type: Bug
Reporter: Sam Abeyratne

 Can't read EMR HBase cluster from properly built Cloudera Spark Cluster.
 If I scp hadoop-yarn-client-2.2.0.jar from our EMR hbase cluster lib dir and 
 manually add it as a lib to my jar it does NOT give me a noSuchMethod error, 
 but does give me a weird EOF exception (see below).
 Usually I use SBT to build Jars, but the EMR distros are very strange I can't 
 find a proper repository for them.  I'm thinking only thing we can do is get 
 our sysadm to rebuild the hbase cluster to use a proper cloudera hbase / 
 hadoop.
 SBT Dependencies include: 
 org.apache.spark % spark-core_2.10 % 0.9.0-incubating,
 org.apache.hbase % hbase % 0.94.7,
 14/03/11 19:08:06 WARN scheduler.TaskSetManager: Lost TID 95 (task 0.0:3)
 14/03/11 19:08:06 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.EOFException
 java.io.EOFException
   at 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)
   at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1015)
   at 
 org.apache.hadoop.io.WritableUtils.readCompressedByteArray(WritableUtils.java:39)
   at 
 org.apache.hadoop.io.WritableUtils.readCompressedString(WritableUtils.java:87)
   at 
 org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:185)
   at 
 org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2433)
   at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
   at 
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
   at 
 org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
   at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-11-12 Thread Anson Abraham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208258#comment-14208258
 ] 

Anson Abraham commented on SPARK-1867:
--

I'm running 1.1 (standalone) w/o yarn on CDH 5.2.  I'm just doing a quick test:
val source = sc.textFile(/tmp/testfile.txt)
source.saveAsTextFile(/tmp/test_spark_output)

and I'm hitting that issue, java.lang.IllegalStateException: unread block data. 
 The versions on all the nodes are identical.  I can't figure out what the 
exact issue is.

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart % 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (SPARK-4368) Ceph integration?

2014-11-12 Thread Serge Smertin (JIRA)

Serge Smertin created SPARK-4368:


 Summary: Ceph integration?
 Key: SPARK-4368
 URL: https://issues.apache.org/jira/browse/SPARK-4368
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Reporter: Serge Smertin


There is a use-case of storing big number of relatively small BLOB objects 
(2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
is a need to process those BLOBs close to data themselves, so that's why 
MapReduce paradigm is good, as it guarantees data locality.

Ceph seems to be one of the systems that maintains both of the properties 
(small files and data locality) -  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
know already that Spark supports GlusterFS - 
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E

So i wonder, could there be an integration with this storage solution and what 
could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208318#comment-14208318
 ] 

Cristian Opris commented on SPARK-3633:
---

This looks like a memory leak in ConnectionManager where responses 
(BufferMessage) are retained by the TimerTask waiting for ACK even after the 
Future completes with Success, please see the 

Possibly related to 
https://github.com/apache/spark/commit/76fa0eaf515fd6771cdd69422b1259485debcae5 

+--+--+--+-+
|Class |   Objects| 
Shallow Size |  Retained Size  |
+--+--+--+-+
|  java.util.TaskQueue |10 %  | 
  240 %  | 885,048,168  100 %  |
|  java.util.TimerTask[]   |10 %  |
2,0640 %  | 885,048,144   99 %  |
|  org.apache.spark.network.ConnectionManager$$anon$5  |  2865 %  |   
13,7280 %  |  ~  885,046,080   99 %  |
|  org.apache.spark.network.BufferMessage  |  572   10 %  |   
36,6080 %  |  ~  885,018,624   99 %  |
|  scala.concurrent.impl.Promise$DefaultPromise|  2865 %  |
4,5760 %  |  ~  884,968,288   99 %  |
|  scala.util.Success  |  2865 %  |
4,5760 %  |  ~  884,963,712   99 %  |
|  scala.collection.mutable.ArrayBuffer|  572   10 %  |   
13,7280 %  |  ~  884,915,768   99 %  |
|  java.lang.Object[]  |  572   10 %  |   
45,7600 %  |  ~  884,902,040   99 %  |
|  java.nio.HeapByteBuffer |  2865 %  |   
13,7280 %  |  ~  884,856,280   99 %  |
|  byte[]  |  2865 %  |  
884,842,552   99 %  |  ~  884,842,552   99 %  |
|  java.net.InetSocketAddress  |  572   10 %  |
9,1520 %  |   ~  66,2480 %  |
|  java.net.InetSocketAddress$InetSocketAddressHolder  |  572   10 %  |   
13,7280 %  |   ~  57,0960 %  |
|  java.net.Inet4Address   |  2865 %  |
6,8640 %  |   ~  43,3680 %  |
|  java.net.InetAddress$InetAddressHolder  |  2865 %  |
6,8640 %  |   ~  36,5040 %  |
|  java.lang.String|  2855 %  |
6,8400 %  |   ~  29,6400 %  |
|  char[]  |  2855 %  |   
22,8000 %  |   ~  22,8000 %  |
|  java.lang.Object|  2865 %  |
4,5760 %  |~  4,5760 %  |
+--+--+--+-+

Generated by YourKit Java Profiler 2014 build 14110 12-Nov-2014 17:44:32


 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi
Priority: Critical

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)

[jira] [Comment Edited] (SPARK-3633) Fetches failure observed after SPARK-2711


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208318#comment-14208318
 ] 

Cristian Opris edited comment on SPARK-3633 at 11/12/14 5:48 PM:
-

This looks like a memory leak in ConnectionManager where responses 
(BufferMessage) are retained by the TimerTask waiting for ACK even after the 
Future completes with Success, please see the reference chain from a heap dump 
below

Possibly related to 
https://github.com/apache/spark/commit/76fa0eaf515fd6771cdd69422b1259485debcae5 

+--+--+--+-+
|Class |   Objects| 
Shallow Size |  Retained Size  |
+--+--+--+-+
|  java.util.TaskQueue |10 %  | 
  240 %  | 885,048,168  100 %  |
|  java.util.TimerTask[]   |10 %  |
2,0640 %  | 885,048,144   99 %  |
|  org.apache.spark.network.ConnectionManager$$anon$5  |  2865 %  |   
13,7280 %  |  ~  885,046,080   99 %  |
|  org.apache.spark.network.BufferMessage  |  572   10 %  |   
36,6080 %  |  ~  885,018,624   99 %  |
|  scala.concurrent.impl.Promise$DefaultPromise|  2865 %  |
4,5760 %  |  ~  884,968,288   99 %  |
|  scala.util.Success  |  2865 %  |
4,5760 %  |  ~  884,963,712   99 %  |
|  scala.collection.mutable.ArrayBuffer|  572   10 %  |   
13,7280 %  |  ~  884,915,768   99 %  |
|  java.lang.Object[]  |  572   10 %  |   
45,7600 %  |  ~  884,902,040   99 %  |
|  java.nio.HeapByteBuffer |  2865 %  |   
13,7280 %  |  ~  884,856,280   99 %  |
|  byte[]  |  2865 %  |  
884,842,552   99 %  |  ~  884,842,552   99 %  |
|  java.net.InetSocketAddress  |  572   10 %  |
9,1520 %  |   ~  66,2480 %  |
|  java.net.InetSocketAddress$InetSocketAddressHolder  |  572   10 %  |   
13,7280 %  |   ~  57,0960 %  |
|  java.net.Inet4Address   |  2865 %  |
6,8640 %  |   ~  43,3680 %  |
|  java.net.InetAddress$InetAddressHolder  |  2865 %  |
6,8640 %  |   ~  36,5040 %  |
|  java.lang.String|  2855 %  |
6,8400 %  |   ~  29,6400 %  |
|  char[]  |  2855 %  |   
22,8000 %  |   ~  22,8000 %  |
|  java.lang.Object|  2865 %  |
4,5760 %  |~  4,5760 %  |
+--+--+--+-+

Generated by YourKit Java Profiler 2014 build 14110 12-Nov-2014 17:44:32



was (Author: onetoinfin...@yahoo.com):
This looks like a memory leak in ConnectionManager where responses 
(BufferMessage) are retained by the TimerTask waiting for ACK even after the 
Future completes with Success, please see the 

Possibly related to 
https://github.com/apache/spark/commit/76fa0eaf515fd6771cdd69422b1259485debcae5 

+--+--+--+-+
|Class |   Objects| 
Shallow Size |  Retained Size  |
+--+--+--+-+
|  java.util.TaskQueue |10 %  | 
  240 %  | 885,048,168  100 %  |
|  java.util.TimerTask[]   |10 %  |
2,0640 %  | 885,048,144   99 %  |
|  org.apache.spark.network.ConnectionManager$$anon$5  |  2865 %  |   
13,7280 %  |  ~  885,046,080   99 %  |
|  org.apache.spark.network.BufferMessage  |  572   10 %  |   
36,6080 %  |  ~  885,018,624   99 %  |
|  scala.concurrent.impl.Promise$DefaultPromise|  2865 %  |
4,5760 %  |  ~  884,968,288   99 %  |
|  scala.util.Success  |  2865 %  |
4,5760 %  |  ~  884,963,712   99 %  |
|  scala.collection.mutable.ArrayBuffer|  572   10 %  |   
13,7280 %  |  ~  884,915,768   99 %  |
|  java.lang.Object[]  |  572   10 %  |   
45,7600 %  |  ~  884,902,040   99 %  |
|  java.nio.HeapByteBuffer

[jira] [Created] (SPARK-4369) TreeModel.predict does not work with RDD

2014-11-12 Thread Davies Liu (JIRA)

Davies Liu created SPARK-4369:
-

 Summary: TreeModel.predict does not work with RDD
 Key: SPARK-4369
 URL: https://issues.apache.org/jira/browse/SPARK-4369
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Blocker


{code}
Stack Trace
-
Traceback (most recent call last):
  File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py,
line 49, in module
predictions = model.predict(parsedData.map(lambda x: x.features))
  File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42,
in predict
return self.call(predict, x.map(_convert_to_vector))
  File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
140, in call
return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
  File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
117, in callJavaFunc
return _java2py(sc, func(*args))
  File
/home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File
/home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace:
py4j.Py4JException: Method predict([class
org.apache.spark.api.java.JavaRDD]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1014) MultilogisticRegressionWithSGD

2014-11-12 Thread Kun Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208352#comment-14208352
 ] 

Kun Yang commented on SPARK-1014:
-

I am not sure if you can find the pr on the repository. Please find it on
my github:
https://github.com/kunyang1987/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/MultilogisticRegression.scala




 MultilogisticRegressionWithSGD
 --

 Key: SPARK-1014
 URL: https://issues.apache.org/jira/browse/SPARK-1014
 Project: Spark
  Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Kun Yang

 Multilogistic Regression With SGD based on mllib packages
 Use labeledpoint, gradientDescent to train the model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4370) Limit cores used by Netty transfer service based on executor size

2014-11-12 Thread Aaron Davidson (JIRA)

Aaron Davidson created SPARK-4370:
-

 Summary: Limit cores used by Netty transfer service based on 
executor size
 Key: SPARK-4370
 URL: https://issues.apache.org/jira/browse/SPARK-4370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical


Right now, the NettyBlockTransferService uses the total number of cores on the 
system as the number of threads and buffer arenas to create. The latter is more 
troubling -- this can lead to significant allocation of extra heap and direct 
memory in situations where executors are relatively small compared to the whole 
machine. For instance, on a machine with 32 cores, we will allocate (32 cores * 
16MB per arena = 512MB) * 2 for client and server = 1GB direct and heap memory. 
This can be a huge overhead if you're only using, say, 8 of those cores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208378#comment-14208378
 ] 

Cristian Opris commented on SPARK-3633:
---

At first sight (haven't tested this) the problem is in the code below. The 
TimerTask is cancelled on Success but this doesn't actually remove it from the 
Timer TaskQueue since the TimerThread doesn't actually remove cancelled tasks 
until they're actually scheduled to run, which in this case is by default 60 
secs ack timeout.

A quick fix would be to call Timer.purge() after task cancel below, or better 
yet change to a better Timer like the HashedWheel one from Netty 

{code:title=|borderStyle=solid}

val status = new MessageStatus(message, connectionManagerId, s = {
  timeoutTask.cancel()
  s.ackMessage match {
case None = // Indicates a failure where we either never sent or never 
got ACK'd
  promise.failure(new IOException(sendMessageReliably failed without 
being ACK'd))
case Some(ackMessage) =
  if (ackMessage.hasError) {
promise.failure(
  new IOException(sendMessageReliably failed with ACK that 
signalled a remote error))
  } else {
promise.success(ackMessage)
  }
  }
})
{code}

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi
Priority: Critical

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4369) TreeModel.predict does not work with RDD


[ 
https://issues.apache.org/jira/browse/SPARK-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208377#comment-14208377
 ] 

Apache Spark commented on SPARK-4369:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3230

 TreeModel.predict does not work with RDD
 

 Key: SPARK-4369
 URL: https://issues.apache.org/jira/browse/SPARK-4369
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Blocker

 {code}
 Stack Trace
 -
 Traceback (most recent call last):
   File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py,
 line 49, in module
 predictions = model.predict(parsedData.map(lambda x: x.features))
   File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42,
 in predict
 return self.call(predict, x.map(_convert_to_vector))
   File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
 140, in call
 return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
   File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
 117, in callJavaFunc
 return _java2py(sc, func(*args))
   File
 /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File
 /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 304, in get_return_value
 py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace:
 py4j.Py4JException: Method predict([class
 org.apache.spark.api.java.JavaRDD]) does not exist
 at 
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
 at 
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
 at py4j.Gateway.invoke(Gateway.java:252)
 at 
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4370) Limit cores used by Netty transfer service based on executor size


[ 
https://issues.apache.org/jira/browse/SPARK-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208391#comment-14208391
 ] 

Apache Spark commented on SPARK-4370:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3155

 Limit cores used by Netty transfer service based on executor size
 -

 Key: SPARK-4370
 URL: https://issues.apache.org/jira/browse/SPARK-4370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical

 Right now, the NettyBlockTransferService uses the total number of cores on 
 the system as the number of threads and buffer arenas to create. The latter 
 is more troubling -- this can lead to significant allocation of extra heap 
 and direct memory in situations where executors are relatively small compared 
 to the whole machine. For instance, on a machine with 32 cores, we will 
 allocate (32 cores * 16MB per arena = 512MB) * 2 for client and server = 1GB 
 direct and heap memory. This can be a huge overhead if you're only using, 
 say, 8 of those cores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3530) Pipeline and Parameters


 [ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3530.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3099
[https://github.com/apache/spark/pull/3099]

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3315) Support hyperparameter tuning


 [ 
https://issues.apache.org/jira/browse/SPARK-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-3315.

   Resolution: Fixed
Fix Version/s: 1.2.0

CrossValidator and ParamGridBuilder were included in the PR for SPARK-3530. I'm 
closing this now and I will create separate JIRAs for other tuning features.

 Support hyperparameter tuning
 -

 Key: SPARK-3315
 URL: https://issues.apache.org/jira/browse/SPARK-3315
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0


 Tuning a pipeline and select the best set of parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-12 Thread Manish Amde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208472#comment-14208472
 ] 

Manish Amde commented on SPARK-3717:


[~bbnsumanth] Look forward to your details of your approach. This is an 
important ticket and want to make sure that we all agree on the architecture 
before pursuing the implementation work. Also, as [~josephkb] suggested it 
might be a good idea to get your feet wet with a couple of small patches to get 
used to the Spark contribution workflow.

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-12 Thread Manish Amde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208472#comment-14208472
 ] 

Manish Amde edited comment on SPARK-3717 at 11/12/14 7:02 PM:
--

[~bbnsumanth] Look forward to the details of your approach. This is an 
important ticket and want to make sure that we all agree on the architecture 
before pursuing the implementation work. Also, as [~josephkb] suggested it 
might be a good idea to get your feet wet with a couple of small patches to get 
used to the Spark contribution workflow.


was (Author: manishamde):
[~bbnsumanth] Look forward to your details of your approach. This is an 
important ticket and want to make sure that we all agree on the architecture 
before pursuing the implementation work. Also, as [~josephkb] suggested it 
might be a good idea to get your feet wet with a couple of small patches to get 
used to the Spark contribution workflow.

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4371) Spark crashes with JBoss Logging 3.6.1


[ 
https://issues.apache.org/jira/browse/SPARK-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208525#comment-14208525
 ] 

Sean Owen commented on SPARK-4371:
--

SLF4J is pretty backwards compatible. The right thing to do in general is 
update your dependency to 1.7.x in your app. Does that not work?

 Spark crashes with JBoss Logging 3.6.1
 --

 Key: SPARK-4371
 URL: https://issues.apache.org/jira/browse/SPARK-4371
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Florent Pellerin

 When using JBoss-logging which itself depends on slf4j 1.6.1,
 Since SLF4JBridgeHandler.removeHandlersForRootLogger() was added in slf4j 
 1.6.5,
 Since spark/Logging.scala is doing at line 147:
 bridgeClass.getMethod(removeHandlersForRootLogger).invoke(null)
 Spark is crashing:
 java.lang.ExceptionInInitializerError: null
 at java.lang.Class.getMethod(Class.java:1670)
 at org.apache.spark.Logging$.init(Logging.scala:147)
 at org.apache.spark.Logging$.clinit(Logging.scala)
 at 
 org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:104)
 at org.apache.spark.Logging$class.log(Logging.scala:51)
 at org.apache.spark.SecurityManager.log(SecurityManager.scala:143)
 at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
 at org.apache.spark.SecurityManager.logInfo(SecurityManager.scala:143)
 at 
 org.apache.spark.SecurityManager.setViewAcls(SecurityManager.scala:208)
 at org.apache.spark.SecurityManager.init(SecurityManager.scala:167)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:151)
 at org.apache.spark.SparkContext.init(SparkContext.scala:203)
 I suggest Spark should at least silently swallow the exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-11-12 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-3039:
--

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy
Assignee: Bertrand Bossy
 Fix For: 1.2.0


 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-12 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208595#comment-14208595
 ] 

Kousuke Saruta commented on SPARK-4267:
---

Hi [~ozawa], On my YARN-2.5.1(JDK 1.7.0_60) cluster, Spark Shell works well.

I built with following command.
{code}
sbt/sbt -Dhadoop.version=2.5.1 -Pyarn  assembly
{code}

And launched Spark Shell with following command.
{code}
bin/spark-shell --master yarn --deploy-mode client --executor-cores 1 
--driver-memory 512M --executor-memory 512M --num-executors 1
{code}

And then, I ran job with following script.
{code}
sc.textFile(hdfs:///user/kou/LICENSE.txt).flatMap(line = line.split( 
)).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 
16).saveAsTextFile(hdfs:///user/kou/LICENSE.txt.count)
{code}

So I think the problem is not caused by the version of Hadoop.
One possible case is that SparkContext#stop is called between instantiating 
SparkContext and running job accidentally.
Did you see any ERROR log on the shell?

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)

[jira] [Comment Edited] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-12 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208595#comment-14208595
 ] 

Kousuke Saruta edited comment on SPARK-4267 at 11/12/14 8:07 PM:
-

Hi [~ozawa], On my YARN 2.5.1(JDK 1.7.0_60) cluster, Spark Shell works well.

I built with following command.
{code}
sbt/sbt -Dhadoop.version=2.5.1 -Pyarn  assembly
{code}

And launched Spark Shell with following command.
{code}
bin/spark-shell --master yarn --deploy-mode client --executor-cores 1 
--driver-memory 512M --executor-memory 512M --num-executors 1
{code}

And then, I ran job with following script.
{code}
sc.textFile(hdfs:///user/kou/LICENSE.txt).flatMap(line = line.split( 
)).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 
16).saveAsTextFile(hdfs:///user/kou/LICENSE.txt.count)
{code}

So I think the problem is not caused by the version of Hadoop.
One possible case is that SparkContext#stop is called between instantiating 
SparkContext and running job accidentally.
Did you see any ERROR log on the shell?


was (Author: sarutak):
Hi [~ozawa], On my YARN-2.5.1(JDK 1.7.0_60) cluster, Spark Shell works well.

I built with following command.
{code}
sbt/sbt -Dhadoop.version=2.5.1 -Pyarn  assembly
{code}

And launched Spark Shell with following command.
{code}
bin/spark-shell --master yarn --deploy-mode client --executor-cores 1 
--driver-memory 512M --executor-memory 512M --num-executors 1
{code}

And then, I ran job with following script.
{code}
sc.textFile(hdfs:///user/kou/LICENSE.txt).flatMap(line = line.split( 
)).map(word = (word, 1)).persist().reduceByKey((a, b) = a + b, 
16).saveAsTextFile(hdfs:///user/kou/LICENSE.txt.count)
{code}

So I think the problem is not caused by the version of Hadoop.
One possible case is that SparkContext#stop is called between instantiating 
SparkContext and running job accidentally.
Did you see any ERROR log on the shell?

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)

[jira] [Resolved] (SPARK-3660) Initial RDD for updateStateByKey transformation

2014-11-12 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-3660.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Initial RDD for updateStateByKey transformation
 ---

 Key: SPARK-3660
 URL: https://issues.apache.org/jira/browse/SPARK-3660
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Soumitra Kumar
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 How to initialize state tranformation updateStateByKey?
 I have word counts from previous spark-submit run, and want to load that in 
 next spark-submit job to start counting over that.
 One proposal is to add following argument to updateStateByKey methods.
 initial : Option [RDD [(K, S)]] = None
 This will maintain the backward compatibility as well.
 I have a working code as well.
 This thread started on spark-user list at:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3660) Initial RDD for updateStateByKey transformation

2014-11-12 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3660:
-
Priority: Major  (was: Minor)

 Initial RDD for updateStateByKey transformation
 ---

 Key: SPARK-3660
 URL: https://issues.apache.org/jira/browse/SPARK-3660
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Soumitra Kumar
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 How to initialize state tranformation updateStateByKey?
 I have word counts from previous spark-submit run, and want to load that in 
 next spark-submit job to start counting over that.
 One proposal is to add following argument to updateStateByKey methods.
 initial : Option [RDD [(K, S)]] = None
 This will maintain the backward compatibility as well.
 I have a working code as well.
 This thread started on spark-user list at:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-updateStateByKey-operation-td14772.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4372) Make LR and SVM's default parameters consistent in Scala and Python


[ 
https://issues.apache.org/jira/browse/SPARK-4372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208597#comment-14208597
 ] 

Apache Spark commented on SPARK-4372:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3232

 Make LR and SVM's default parameters consistent in Scala and Python 
 

 Key: SPARK-4372
 URL: https://issues.apache.org/jira/browse/SPARK-4372
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Xiangrui Meng

 The current default regParam is 1.0 and regType is claimed to be none in 
 Python (but actually it is l2), while regParam = 0.0 and regType is L2 in 
 Scala. We should make the default values consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3666.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Extract interfaces for EdgeRDD and VertexRDD
 

 Key: SPARK-3666
 URL: https://issues.apache.org/jira/browse/SPARK-3666
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Blocker
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-11-12 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208766#comment-14208766
 ] 

Josh Rosen commented on SPARK-3630:
---

Hi [~rdub],

Thanks for the detailed logs.  Do you have access to the executor logs from the 
executors where fetch failures occurred?  I'd like to see whether those logs 
contain more information about why those fetches failed.

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2996) Standalone and Yarn have different settings for adding the user classpath first


[ 
https://issues.apache.org/jira/browse/SPARK-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208767#comment-14208767
 ] 

Apache Spark commented on SPARK-2996:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3233

 Standalone and Yarn have different settings for adding the user classpath 
 first
 ---

 Key: SPARK-2996
 URL: https://issues.apache.org/jira/browse/SPARK-2996
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor

 Standalone uses spark.files.userClassPathFirst while Yarn uses 
 spark.yarn.user.classpath.first. Adding support for the former in Yarn 
 should be pretty trivial.
 Don't know if Mesos has anything similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4369) TreeModel.predict does not work with RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4369.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3230
[https://github.com/apache/spark/pull/3230]

 TreeModel.predict does not work with RDD
 

 Key: SPARK-4369
 URL: https://issues.apache.org/jira/browse/SPARK-4369
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Blocker
 Fix For: 1.2.0


 {code}
 Stack Trace
 -
 Traceback (most recent call last):
   File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py,
 line 49, in module
 predictions = model.predict(parsedData.map(lambda x: x.features))
   File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42,
 in predict
 return self.call(predict, x.map(_convert_to_vector))
   File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
 140, in call
 return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
   File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
 117, in callJavaFunc
 return _java2py(sc, func(*args))
   File
 /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File
 /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 304, in get_return_value
 py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace:
 py4j.Py4JException: Method predict([class
 org.apache.spark.api.java.JavaRDD]) does not exist
 at 
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
 at 
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
 at py4j.Gateway.invoke(Gateway.java:252)
 at 
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4369) TreeModel.predict does not work with RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4369:
-
Assignee: Davies Liu

 TreeModel.predict does not work with RDD
 

 Key: SPARK-4369
 URL: https://issues.apache.org/jira/browse/SPARK-4369
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.2.0


 {code}
 Stack Trace
 -
 Traceback (most recent call last):
   File /home/rprabhu/Coding/github/SDNDDoS/classification/DecisionTree.py,
 line 49, in module
 predictions = model.predict(parsedData.map(lambda x: x.features))
   File /home/rprabhu/Software/spark/python/pyspark/mllib/tree.py, line 42,
 in predict
 return self.call(predict, x.map(_convert_to_vector))
   File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
 140, in call
 return callJavaFunc(self._sc, getattr(self._java_model, name), *a)
   File /home/rprabhu/Software/spark/python/pyspark/mllib/common.py, line
 117, in callJavaFunc
 return _java2py(sc, func(*args))
   File
 /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File
 /home/rprabhu/Software/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 304, in get_return_value
 py4j.protocol.Py4JError: An error occurred while calling o39.predict. Trace:
 py4j.Py4JException: Method predict([class
 org.apache.spark.api.java.JavaRDD]) does not exist
 at 
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
 at 
 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
 at py4j.Gateway.invoke(Gateway.java:252)
 at 
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3667) Deprecate Graph#unpersistVertices and document how to correctly unpersist graphs


 [ 
https://issues.apache.org/jira/browse/SPARK-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3667.
---
  Resolution: Won't Fix
Target Version/s:   (was: 1.2.0)

 Deprecate Graph#unpersistVertices and document how to correctly unpersist 
 graphs
 

 Key: SPARK-3667
 URL: https://issues.apache.org/jira/browse/SPARK-3667
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-11-12 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208857#comment-14208857
 ] 

Ryan Williams commented on SPARK-3630:
--

I ran a few more instances of this job, toggling {{spark.shuffle.manager}} 
between {{hash}} and {{sort}}, and wasn't able to continue reproducing the 
Snappy errors. Some jobs did go into a millions-of-FetchFailures death spiral, 
and some passed. Not sure how to help debug these transient failures.

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-11-12 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208863#comment-14208863
 ] 

Ryan Williams commented on SPARK-3630:
--

[~joshrosen] I do have access to the logs, though I don't remember exactly 
which job was which. Let me try to put them somewhere you can see them.

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-11-12 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208876#comment-14208876
 ] 

Ryan Williams commented on SPARK-3630:
--

[~joshrosen] can you see [this dropbox 
folder|https://www.dropbox.com/sh/pn0bik3tvy73wfi/AAByFlQVJ3QUOqiKYKXt31RGa?dl=0]?
 The {{\*.logs}} and {{\*.stacks}} files there are the raw yarn logs and a 
histogram of stack traces, respectively, for four of my jobs that have Snappy 
exceptions in the logs (0005, 0006, 0007, and 0008).

Let me know if that helps or I can provide other info, thanks.

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4373) MLlib unit tests failed maven test

Xiangrui Meng created SPARK-4373:


 Summary: MLlib unit tests failed maven test
 Key: SPARK-4373
 URL: https://issues.apache.org/jira/browse/SPARK-4373
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We should make sure there is at most one SparkContext running at any time 
inside the same JVM. Maven initializes all test classes first and then runs 
tests. So we cannot initialize sc as a member.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3665) Java API for GraphX


 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3665:
--
Description: 
The Java API will wrap the Scala API in a similar manner as JavaRDD. Components 
will include:
# JavaGraph
#- removes optional param from persist, subgraph, mapReduceTriplets, 
Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
#- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
#- merges multiple parameters lists
#- incorporates GraphOps
# JavaVertexRDD
# JavaEdgeRDD

  was:
The Java API will wrap the Scala API in a similar manner as JavaRDD. Components 
will include:
# JavaGraph
#- removes optional param from persist, subgraph, mapReduceTriplets, 
Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
#- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
#- merges multiple parameters lists
#- incorporates GraphOps
# JavaVertexRDD
# JavaEdgeRDD
# JavaGraphLoader
#- removes optional params, or uses builder pattern


 Java API for GraphX
 ---

 Key: SPARK-3665
 URL: https://issues.apache.org/jira/browse/SPARK-3665
 Project: Spark
  Issue Type: Improvement
  Components: GraphX, Java API
Reporter: Ankur Dave
Assignee: Ankur Dave

 The Java API will wrap the Scala API in a similar manner as JavaRDD. 
 Components will include:
 # JavaGraph
 #- removes optional param from persist, subgraph, mapReduceTriplets, 
 Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
 #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
 #- merges multiple parameters lists
 #- incorporates GraphOps
 # JavaVertexRDD
 # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3665) Java API for GraphX


[ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208916#comment-14208916
 ] 

Apache Spark commented on SPARK-3665:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/3234

 Java API for GraphX
 ---

 Key: SPARK-3665
 URL: https://issues.apache.org/jira/browse/SPARK-3665
 Project: Spark
  Issue Type: Improvement
  Components: GraphX, Java API
Reporter: Ankur Dave
Assignee: Ankur Dave

 The Java API will wrap the Scala API in a similar manner as JavaRDD. 
 Components will include:
 # JavaGraph
 #- removes optional param from persist, subgraph, mapReduceTriplets, 
 Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
 #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
 #- merges multiple parameters lists
 #- incorporates GraphOps
 # JavaVertexRDD
 # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4374) LibraryClientSuite has been flaky

Reynold Xin created SPARK-4374:
--

 Summary: LibraryClientSuite has been flaky
 Key: SPARK-4374
 URL: https://issues.apache.org/jira/browse/SPARK-4374
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Timothy Hunter
Priority: Critical


https://github.com/databricks/universe/pull/1780#issuecomment-62809791

LibraryClientSuite: PROD-2230 sanity checks for old data (historical: 55.00% 
[n=20], recent: 55.00% [n=20])
LibraryClientSuite: A simple case for python (historical: 55.00% [n=20], 
recent: 55.00% [n=20])

I disabled the two test cases in LibraryClientSuite. Tim - can you look into 
that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2672) Support compression in wholeFile()

2014-11-12 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2672.
---
   Resolution: Fixed
Fix Version/s: 1.3.0
   1.2.0

Issue resolved by pull request 3005
[https://github.com/apache/spark/pull/3005]

 Support compression in wholeFile()
 --

 Key: SPARK-2672
 URL: https://issues.apache.org/jira/browse/SPARK-2672
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0, 1.3.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 The wholeFile() can not read compressed files, it should be, just like 
 textFile().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-4374) LibraryClientSuite has been flaky


 [ 
https://issues.apache.org/jira/browse/SPARK-4374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin deleted SPARK-4374:
---


 LibraryClientSuite has been flaky
 -

 Key: SPARK-4374
 URL: https://issues.apache.org/jira/browse/SPARK-4374
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Timothy Hunter
Priority: Critical

 https://github.com/databricks/universe/pull/1780#issuecomment-62809791
 LibraryClientSuite: PROD-2230 sanity checks for old data (historical: 55.00% 
 [n=20], recent: 55.00% [n=20])
 LibraryClientSuite: A simple case for python (historical: 55.00% [n=20], 
 recent: 55.00% [n=20])
 I disabled the two test cases in LibraryClientSuite. Tim - can you look into 
 that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4326) unidoc is broken on master

2014-11-12 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208998#comment-14208998
 ] 

Marcelo Vanzin commented on SPARK-4326:
---

So, this is really weird. Unidoc is run by the sbt build, where none of the 
shading shenanigans from the maven build should apply. The root pom.xml adds 
guava as a dependency for everybody with compile scope when the sbt profile is 
enabled.

That being said, if you look at the output of {{show allDependencies}} from 
within an sbt shell, it will show some components with a guava 11.0.2 
provided dependency. So the profile isn't taking?

Another fun fact is that the dependencies for the core project, where the 
errors above come from, are correct in the output of {{show allDependencies}}; 
it shows guava 14.0.1 compile as it should.

I was able to workaround this by adding guava explicitly in SparkBuild.scala, 
in the {{sharedSettings}} variable:

{code}
libraryDependencies += com.google.guava % guava % 14.0.1
{code}

That got rid of the above errors, but it didn't fix the overall build. Anyone 
more familiar with sbt/unidoc knows what's going on here?

Here are the errors with that hack applied:

{noformat}
[error] 
/work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55:
 not found: type Type
[error]   protected Type type() { return Type.UPLOAD_BLOCK; }
[error] ^
[error] 
/work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/RegisterExecutor.java:44:
 not found: type Type
[error]   protected Type type() { return Type.REGISTER_EXECUTOR; }
[error] ^
[error] 
/work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java:40:
 not found: type Type
[error]   protected Type type() { return Type.OPEN_BLOCKS; }
[error] ^
[error] 
/work/apache/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/StreamHandle.java:39:
 not found: type Type
[error]   protected Type type() { return Type.STREAM_HANDLE; }
[error] ^
{noformat}


 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^

[jira] [Created] (SPARK-4375) assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-4375:
-

 Summary: assembly built with Maven is missing most of repl classes
 Key: SPARK-4375
 URL: https://issues.apache.org/jira/browse/SPARK-4375
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4375) assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4375:
--
Description: In particular, the ones in the split scala-2.10/scala-2.11 
directories aren't being added

 assembly built with Maven is missing most of repl classes
 -

 Key: SPARK-4375
 URL: https://issues.apache.org/jira/browse/SPARK-4375
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Blocker

 In particular, the ones in the split scala-2.10/scala-2.11 directories aren't 
 being added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4326) unidoc is broken on master


[ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209058#comment-14209058
 ] 

Xiangrui Meng commented on SPARK-4326:
--

[~vanzin] Thanks for looking into this issue! This is the commit that caused 
the problem:

SPARK-3796: 
https://github.com/apache/spark/commit/f55218aeb1e9d638df6229b36a59a15ce5363482

It adds Guava 11.0.1 in the pom, which is perhaps not the correct way to 
specify Guava version. [~adav] Could you explain which Guava version you need 
per Hadoop profile?

 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^
 [info] No documentation generated with unsucessful compiler run
 [warn] two warnings found
 [error] 6 errors found
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM
 {code}
 It doesn't happen on branch-1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4326) unidoc is broken on master


 [ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4326:
-
Priority: Critical  (was: Major)

 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Priority: Critical

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^
 [info] No documentation generated with unsucessful compiler run
 [warn] two warnings found
 [error] 6 errors found
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM
 {code}
 It doesn't happen on branch-1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4179) Streaming Linear Regression example has type mismatch


 [ 
https://issues.apache.org/jira/browse/SPARK-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-4179.

Resolution: Not a Problem
  Assignee: Xiangrui Meng  (was: Anant Daksh Asthana)

I'm closing this JIRA because this is already fixed in SPARK-3108 and the user 
guide is up-to-date. But please feel free to re-open it if I missed something.

 Streaming Linear Regression example has type mismatch
 -

 Key: SPARK-4179
 URL: https://issues.apache.org/jira/browse/SPARK-4179
 Project: Spark
  Issue Type: Bug
  Components: Examples, MLlib
Affects Versions: 1.1.0
Reporter: Anant Daksh Asthana
Assignee: Xiangrui Meng

 The example for Streaming Linear Regression on line 65 calls predictOn with a 
 DStream of type[Double, Vector] when the expected type is vector.
 This throws a type error.
 examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala#65



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4376) Put external modules behind build profiles

Patrick Wendell created SPARK-4376:
--

 Summary: Put external modules behind build profiles
 Key: SPARK-4376
 URL: https://issues.apache.org/jira/browse/SPARK-4376
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker


Several people have asked me whether, to speed up the build, we can put the 
external projects behind build flags similar to the kinesis-asl module. Since 
these aren't in the assembly there isn't a great reason to build them by 
default. We can just modify our release script to build them and when we run 
tests.

This doesn't technically block Spark 1.2 but it is going to be looped into a 
separate fix that does block Spark 1.2 so I'm upgrading it to blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3325) Add a parameter to the method print in class DStream.


[ 
https://issues.apache.org/jira/browse/SPARK-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209153#comment-14209153
 ] 

Apache Spark commented on SPARK-3325:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3237

 Add a parameter to the method print in class DStream.
 -

 Key: SPARK-3325
 URL: https://issues.apache.org/jira/browse/SPARK-3325
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Yadong Qi

 def print(num: Int = 10)
 User can control the number of elements which to print.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4364) Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong


 [ 
https://issues.apache.org/jira/browse/SPARK-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu closed SPARK-4364.
---
Resolution: Duplicate

Sorry. Didn't notice SPARK-4297

 Some variable types in org.apache.spark.streaming.JavaAPISuite are wrong
 

 Key: SPARK-4364
 URL: https://issues.apache.org/jira/browse/SPARK-4364
 Project: Spark
  Issue Type: Test
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Trivial
  Labels: unit-test

 Because of the type erase, the unit tests will pass. However, the wrong 
 variable types will confuse people. The locations of these variables can be 
 found in my PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4373) MLlib unit tests failed maven test

2014-11-12 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4373.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3235
[https://github.com/apache/spark/pull/3235]

 MLlib unit tests failed maven test
 --

 Key: SPARK-4373
 URL: https://issues.apache.org/jira/browse/SPARK-4373
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 We should make sure there is at most one SparkContext running at any time 
 inside the same JVM. Maven initializes all test classes first and then runs 
 tests. So we cannot initialize sc as a member.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4370) Limit cores used by Netty transfer service based on executor size


 [ 
https://issues.apache.org/jira/browse/SPARK-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4370.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Limit cores used by Netty transfer service based on executor size
 -

 Key: SPARK-4370
 URL: https://issues.apache.org/jira/browse/SPARK-4370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
 Fix For: 1.2.0


 Right now, the NettyBlockTransferService uses the total number of cores on 
 the system as the number of threads and buffer arenas to create. The latter 
 is more troubling -- this can lead to significant allocation of extra heap 
 and direct memory in situations where executors are relatively small compared 
 to the whole machine. For instance, on a machine with 32 cores, we will 
 allocate (32 cores * 16MB per arena = 512MB) * 2 for client and server = 1GB 
 direct and heap memory. This can be a huge overhead if you're only using, 
 say, 8 of those cores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4375) Assembly built with Maven is missing most of repl classes

2014-11-12 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4375:
--
Summary: Assembly built with Maven is missing most of repl classes  (was: 
assembly built with Maven is missing most of repl classes)

 Assembly built with Maven is missing most of repl classes
 -

 Key: SPARK-4375
 URL: https://issues.apache.org/jira/browse/SPARK-4375
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Blocker

 In particular, the ones in the split scala-2.10/scala-2.11 directories aren't 
 being added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4326) unidoc is broken on master

2014-11-12 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209170#comment-14209170
 ] 

Marcelo Vanzin commented on SPARK-4326:
---

Hmm, but core/pom.xml defines an explicit dependency on guava 14, so it should 
override the 11.0.2 dependency from the shuffle module (which is correct, btw). 
And maven's / sbt's dependency resolution seems to indicate that's happening, 
although unidoc doesn't. That's the weird part. Maybe some bug in the unidoc 
plugin?

 unidoc is broken on master
 --

 Key: SPARK-4326
 URL: https://issues.apache.org/jira/browse/SPARK-4326
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Priority: Critical

 On master, `jekyll build` throws the following error:
 {code}
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala:205:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def rehash(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala:426:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala:558:
  value limit is not a member of object com.google.common.io.ByteStreams
 [error] val bufferedStream = new 
 BufferedInputStream(ByteStreams.limit(fileStream, end - start))
 [error]  ^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:261:
  value hashInt is not a member of com.google.common.hash.HashFunction
 [error]   private def hashcode(h: Int): Int = 
 Hashing.murmur3_32().hashInt(h).asInt()
 [error]^
 [error] 
 /Users/meng/src/spark/core/src/main/scala/org/apache/spark/util/collection/Utils.scala:37:
  type mismatch;
 [error]  found   : java.util.Iterator[T]
 [error]  required: Iterable[?]
 [error] collectionAsScalaIterable(ordering.leastOf(asJavaIterator(input), 
 num)).iterator
 [error]  ^
 [error] 
 /Users/meng/src/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:421:
  value putAll is not a member of 
 com.google.common.cache.Cache[org.apache.hadoop.fs.FileStatus,parquet.hadoop.Footer]
 [error]   footerCache.putAll(newFooters)
 [error]   ^
 [warn] 
 /Users/meng/src/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/parquet/FakeParquetSerDe.scala:34:
  @deprecated now takes two arguments; see the scaladoc.
 [warn] @deprecated(No code should depend on FakeParquetHiveSerDe as it is 
 only intended as a  +
 [warn]  ^
 [info] No documentation generated with unsucessful compiler run
 [warn] two warnings found
 [error] 6 errors found
 [error] (spark/scalaunidoc:doc) Scaladoc generation failed
 [error] Total time: 48 s, completed Nov 10, 2014 1:31:01 PM
 {code}
 It doesn't happen on branch-1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-12 Thread Matthew Daniel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209183#comment-14209183
 ] 

Matthew Daniel commented on SPARK-4267:
---

Apologies, I don't know if we want log verbiage inline or as an attachment.

I experienced this NPE on an EMR cluster, AMI 3.3.0 which is Amazon Hadoop 
2.4.0 against a {{make-distribution.sh}} version with {{-Pyarn}} and 
{{-Phadoop-2.2}} with {{-Dhadoop.version=2.2.0}}. I built it against 2.2 
because some of our jobs run on 2.2, and I thought 2.4 would be backwards 
compatible.

I will try building as you said, using {{sbt assembly}}, but I wanted to reply 
to your comment that yes, I do see an {{ERROR}} line but it isn't helpful to 
me, so I hope it's meaningful to others.

{noformat}
14/11/13 02:58:23 INFO cluster.YarnClientSchedulerBackend: Application report 
from ASM:
 appMasterRpcPort: -1
 appStartTime: 1415847498993
 yarnAppState: ACCEPTED

14/11/13 02:58:23 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, 
PROXY_HOST=10.166.39.198,PROXY_URI_BASE=http://10.166.39.198:9046/proxy/application_1415840940647_0001,
 /proxy/application_1415840940647_0001
14/11/13 02:58:23 INFO ui.JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
14/11/13 02:58:24 INFO cluster.YarnClientSchedulerBackend: Application report 
from ASM:
 appMasterRpcPort: 0
 appStartTime: 1415847498993
 yarnAppState: RUNNING

14/11/13 02:58:29 ERROR cluster.YarnClientSchedulerBackend: Yarn application 
already ended: FINISHED
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/metrics/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/static,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/executors/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/executors,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/environment/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/environment,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/storage/rdd,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/storage/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/storage,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/pool/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/pool,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/stage/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/stage,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages/json,null}
14/11/13 02:58:29 INFO handler.ContextHandler: stopped 
o.e.j.s.ServletContextHandler{/stages,null}
14/11/13 02:58:29 INFO ui.SparkUI: Stopped Spark web UI at 
http://ip-10-166-39-198.ec2.internal:4040
14/11/13 02:58:29 INFO scheduler.DAGScheduler: Stopping DAGScheduler
14/11/13 02:58:29 INFO cluster.YarnClientSchedulerBackend: Shutting down all 
executors
14/11/13 02:58:29 INFO cluster.YarnClientSchedulerBackend: Asking each executor 
to shut down
14/11/13 02:58:29 INFO cluster.YarnClientSchedulerBackend: Stopped
14/11/13 02:58:30 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor 
stopped!
14/11/13 02:58:30 INFO network.ConnectionManager: Selector thread was 
interrupted!
14/11/13 02:58:30 INFO network.ConnectionManager: ConnectionManager stopped
14/11/13 02:58:30 INFO storage.MemoryStore: MemoryStore cleared
14/11/13 02:58:30 INFO storage.BlockManager: BlockManager stopped
14/11/13 02:58:30 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
14/11/13 02:58:30 INFO spark.SparkContext: Successfully stopped SparkContext
14/11/13 02:58:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.
14/11/13 02:58:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote 
daemon shut down; proceeding with flushing remote transports.
14/11/13 02:58:30 INFO Remoting: Remoting shut down
14/11/13 02:58:30 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.
14/11/13 02:58:47 INFO

[jira] [Commented] (SPARK-750) LocalSparkContext should be included in Spark JAR

2014-11-12 Thread Nathan M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209167#comment-14209167
 ] 

Nathan M commented on SPARK-750:


+1 This shouldnt be hard, in maven its a plugin to add in the 
spark/core/pom.xml file like describe here 
http://maven.apache.org/plugins/maven-jar-plugin/examples/create-test-jar.html


 LocalSparkContext should be included in Spark JAR
 -

 Key: SPARK-750
 URL: https://issues.apache.org/jira/browse/SPARK-750
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Josh Rosen
Priority: Minor

 To aid third-party developers in writing unit tests with Spark, 
 LocalSparkContext should be included in the Spark JAR.  Right now, it appears 
 to be excluded because it is located in one of the Spark test directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4294) The same function should have the same realization.

2014-11-12 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-4294:


Assignee: Tathagata Das

 The same function should have the same realization.
 ---

 Key: SPARK-4294
 URL: https://issues.apache.org/jira/browse/SPARK-4294
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Yadong Qi
Assignee: Tathagata Das
Priority: Minor
 Fix For: 1.2.0


 In class TransformedDStream:
 require(parents.length  0, List of DStreams to transform is empty)
 require(parents.map(_.ssc).distinct.size == 1, Some of the DStreams have 
 different contexts)
 require(parents.map(_.slideDuration).distinct.size == 1,
 Some of the DStreams have different slide durations)
 In class UnionDStream:
   if (parents.length == 0) {
 throw new IllegalArgumentException(Empty array of parents)
   }
   if (parents.map(_.ssc).distinct.size  1) {
 throw new IllegalArgumentException(Array of parents have different 
 StreamingContexts)
   }
   if (parents.map(_.slideDuration).distinct.size  1) {
 throw new IllegalArgumentException(Array of parents have different slide 
 times)
   }
 The function is the same, but the realization is not. I think they shoule be 
 the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-12 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209239#comment-14209239
 ] 

Kousuke Saruta commented on SPARK-4267:
---

Hi [~bugzi...@mdaniel.scdi.com].
The NPE is caused by SparkContext stopped because Application finished 
accidentally.
I don't know why your application finished before running job for now.
Can you see some ERROR message on the logs of ApplicationMaster or 
ResourceManager?


 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   
   
  
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-11-12 Thread Derrick Burns (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209255#comment-14209255
 ] 

Derrick Burns commented on SPARK-2620:
--

I also hit the bug when running Spark 1.1.0 in local mode.

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4375) Assembly built with Maven is missing most of repl classes

[
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209305#comment-14209305
]

Patrick Wendell edited comment on SPARK-4375 at 11/13/14 5:37 AM:
--

Hey Sandy,

What about the following solution:

1. For the repl case, we make the change you are suggesting and simply drop the
need for -Pscala-2.10 to be there explicilty.
2. We no longer include the examples module or the external project modules by
default in the build. There is a profile for each of the external projects and
a profile for the examples.
2. When building the examples, you need to specify, somewhat pedantically, all
of the necessary external sub projects and also -Pscala-2.10 or -Pscala-2.11.
We can just give people the exact commands to run for 2.10 and 2.11 examples in
the maven docs.

The main benefits I see is that there is no regression for someone doing a
package for Scala 2.10 which is the common case. If someone wants to build the
examples, they need to go and do a bit of extra work to look up the new
command, but it's mostly straightforward. Of course, all of our packages will
still have the examples pre-built.

was (Author: pwendell):
Hey Sandy,

What about the following solution:

1. For the repl case, we make the change you are suggesting and simply drop the
need for -Pscala-2.10 to be there explicilty.
2. We no longer include the examples module or the external project modules by
default in the build.
2. When building the examples, you need to specify, somewhat pedantically, all
of the necessary external sub projects and also -Pscala-2.10 or -Pscala-2.11.
We can just give people the exact commands to run for 2.10 and 2.11 examples in
the maven docs.

Assembly built with Maven is missing most of repl classes
-

Key: SPARK-4375
URL: https://issues.apache.org/jira/browse/SPARK-4375
Project: Spark
Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Blocker

In particular, the ones in the split scala-2.10/scala-2.11 directories aren't
being added

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4375) Assembly built with Maven is missing most of repl classes

[
https://issues.apache.org/jira/browse/SPARK-4375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14209305#comment-14209305
]

Patrick Wendell commented on SPARK-4375:

Hey Sandy,

What about the following solution:

1. For the repl case, we make the change you are suggesting and simply drop the
need for -Pscala-2.10 to be there explicilty.
2. We no longer include the examples module or the external project modules by
default in the build.
2. When building the examples, you need to specify, somewhat pedantically, all
of the necessary external sub projects and also -Pscala-2.10 or -Pscala-2.11.
We can just give people the exact commands to run for 2.10 and 2.11 examples in
the maven docs.

Assembly built with Maven is missing most of repl classes
-

Key: SPARK-4375
URL: https://issues.apache.org/jira/browse/SPARK-4375
Project: Spark
Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Blocker

In particular, the ones in the split scala-2.10/scala-2.11 directories aren't
being added

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4375) Assembly built with Maven is missing most of repl classes