date:20140526


 [ 
https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1914.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Simplify CountFunction not to traverse to evaluate all child expressions.
 -

 Key: SPARK-1914
 URL: https://issues.apache.org/jira/browse/SPARK-1914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
 Fix For: 1.0.0


 {{CountFunction}} should count up only if the child's evaluated value is not 
 null.
 Because it traverses to evaluate all child expressions, even if the child is 
 null, it counts up if one of the all children is not null.
 To reproduce this bug in {{sbt hive/console}}:
 {code}
 scala hql(SELECT COUNT(*) FROM src1).collect()
 res1: Array[org.apache.spark.sql.Row] = Array([25])
 scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect()
 res2: Array[org.apache.spark.sql.Row] = Array([10])
 scala hql(SELECT COUNT(key + 1) FROM src1).collect()
 res3: Array[org.apache.spark.sql.Row] = Array([25])
 {code}
 {{res3}} should be 15 since there are 10 null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.


 [ 
https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1914:
---

Assignee: Takuya Ueshin

 Simplify CountFunction not to traverse to evaluate all child expressions.
 -

 Key: SPARK-1914
 URL: https://issues.apache.org/jira/browse/SPARK-1914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.0.0


 {{CountFunction}} should count up only if the child's evaluated value is not 
 null.
 Because it traverses to evaluate all child expressions, even if the child is 
 null, it counts up if one of the all children is not null.
 To reproduce this bug in {{sbt hive/console}}:
 {code}
 scala hql(SELECT COUNT(*) FROM src1).collect()
 res1: Array[org.apache.spark.sql.Row] = Array([25])
 scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect()
 res2: Array[org.apache.spark.sql.Row] = Array([10])
 scala hql(SELECT COUNT(key + 1) FROM src1).collect()
 res3: Array[org.apache.spark.sql.Row] = Array([25])
 {code}
 {{res3}} should be 15 since there are 10 null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1925) Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid


 [ 
https://issues.apache.org/jira/browse/SPARK-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1925:
---

Description: 
I believe this is a typo:

{code}
  if ((level  0)  (parentFilters.length == 0)) {
return false
  }
{code}

Should use  here.

  was:
I believe this is a typo:

  if ((level  0)  (parentFilters.length == 0)) {
return false
  }

Should use  here.


 Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid
 --

 Key: SPARK-1925
 URL: https://issues.apache.org/jira/browse/SPARK-1925
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Shixiong Zhu
Priority: Minor
  Labels: easyfix

 I believe this is a typo:
 {code}
   if ((level  0)  (parentFilters.length == 0)) {
 return false
   }
 {code}
 Should use  here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1926) Nullability of Max/Min/First should be true.

2014-05-26 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008631#comment-14008631
 ] 

Takuya Ueshin commented on SPARK-1926:
--

PRed: https://github.com/apache/spark/pull/881

 Nullability of Max/Min/First should be true.
 

 Key: SPARK-1926
 URL: https://issues.apache.org/jira/browse/SPARK-1926
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 Nullability of {{Max}}/{{Min}}/{{First}} should be {{true}} because they 
 return {{null}} if there are no rows.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1329) ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions

2014-05-26 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008771#comment-14008771
 ] 

Daniel Darabos commented on SPARK-1329:
---

Sorry, I haven't looked into fixing this. We ended up not using GraphX, so we 
are no longer affected by the bug.

 ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than 
 node partitions
 

 Key: SPARK-1329
 URL: https://issues.apache.org/jira/browse/SPARK-1329
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0
Reporter: Daniel Darabos

 To reproduce, let's look at a graph with two nodes in one partition, and two 
 edges between them split across two partitions:
 scala val vs = sc.makeRDD(Seq(1L-null, 2L-null), 1)
 scala val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, 
 null)), 2)
 scala val g = graphx.Graph(vs, es)
 Everything seems fine, until GraphX needs to join the two RDDs:
 scala g.triplets.collect
 [...]
 java.lang.ArrayIndexOutOfBoundsException: 1
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an 
 array of length vertices.partitions.size, and then looks up partition IDs 
 from the edges.partitionsRDD in it.
 A graph usually has more edges than nodes. So it is natural to have more edge 
 partitions than node partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1927) Implicits declared in companion objects not found in Spark shell

2014-05-26 Thread JIRA

Piotr Kołaczkowski created SPARK-1927:
-

 Summary: Implicits declared in companion objects not found in 
Spark shell
 Key: SPARK-1927
 URL: https://issues.apache.org/jira/browse/SPARK-1927
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
 Environment: Ubuntu Linux 14.04, Oracle Java 7u55
Reporter: Piotr Kołaczkowski


{code}
scala :paste
// Entering paste mode (ctrl-D to finish)

trait Mapper[T]
class Foo
object Foo { implicit object FooMapper extends Mapper[Foo] }

// Exiting paste mode, now interpreting.

defined trait Mapper
defined class Foo
defined module Foo

scala implicitly[Mapper[Foo]]
console:28: error: could not find implicit value for parameter e: Mapper[Foo]
  implicitly[Mapper[Foo]]
^
{code}

Exactly same example in the official Scala REPL (2.10.4):
{code}
scala :paste
// Entering paste mode (ctrl-D to finish)

trait Mapper[T]
class Foo
object Foo { implicit object FooMapper extends Mapper[Foo] }

// Exiting paste mode, now interpreting.

defined trait Mapper
defined class Foo
defined module Foo

scala implicitly[Mapper[Foo]]
res0: Mapper[Foo] = Foo$FooMapper$@4a20e9c6
{code}

I guess it might be another manifestation of the problem of everything being an 
inner object in Spark Repl. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1929) DAGScheduler suspended by local task OOM

2014-05-26 Thread Zhen Peng (JIRA)

Zhen Peng created SPARK-1929:


 Summary: DAGScheduler suspended by local task OOM
 Key: SPARK-1929
 URL: https://issues.apache.org/jira/browse/SPARK-1929
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Zhen Peng
 Fix For: 1.0.0


DAGScheduler does not handle local task OOM properly, and will wait for the job 
result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM

2014-05-26 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008850#comment-14008850
 ] 

Guoqiang Li commented on SPARK-1928:


How to reproduce the issue?

 DAGScheduler suspended by local task OOM
 

 Key: SPARK-1928
 URL: https://issues.apache.org/jira/browse/SPARK-1928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Zhen Peng
 Fix For: 1.0.0


 DAGScheduler does not handle local task OOM properly, and will wait for the 
 job result forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1930) When the yarn containers occupies 8G memory ,the containers were killed

2014-05-26 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1930:
---

Description: 
When the containers occupies 8G memory ,the containers were killed
yarn node manager log:
{code}
2014-05-23 13:35:30,776 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=4947,containerID=container_1400809535638_0015_01_05] is 
running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB physical 
memory used; 10.0 GB of 17.8 GB virtual memory used. Killing container.
Dump of the process-tree for container_1400809535638_0015_01_05 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c 
/usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
%p' -Xms8192m -Xmx8192m  -Xss2m 
-Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
  -Dlog4j.configuration=log4j-spark-container.properties 
-Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
-Dspark.akka.frameSize=20 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
10dian72.domain.test 4 1 
/var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout
 2 
/var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr
 
|- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 
/usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p 
-Xms8192m -Xmx8192m -Xss2m 
-Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
 -Dlog4j.configuration=log4j-spark-container.properties 
-Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 -Dspark.akka.frameSize=20 
org.apache.spark.executor.CoarseGrainedExecutorBackend 
akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
10dian72.domain.test 4 

2014-05-23 13:35:30,776 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Removed ProcessTree with root 4947
2014-05-23 13:35:30,776 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1400809535638_0015_01_05 transitioned from RUNNING to 
KILLING
2014-05-23 13:35:30,777 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
 Cleaning up container container_1400809535638_0015_01_05
2014-05-23 13:35:30,788 WARN 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
from container container_1400809535638_0015_01_05 is : 143
2014-05-23 13:35:30,829 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1400809535638_0015_01_05 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL
2014-05-23 13:35:30,830 INFO 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
absolute path : 
/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05
2014-05-23 13:35:30,830 INFO 
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark
OPERATION=Container Finished - Killed   TARGET=ContainerImplRESULT=SUCCESS  
APPID=application_1400809535638_0015
CONTAINERID=container_1400809535638_0015_01_05
2014-05-23 13:35:30,830 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_1400809535638_0015_01_05 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2014-05-23 13:35:30,830 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
 Removing container_1400809535638_0015_01_05 from application 
application_1400809535638_0015
{code}
I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}}  
https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562

Relative to 8G, 384 MB is too small

  was:
When the containers occupies 8G memory ,the containers were killed
yarn node manager log:
{code}
2014-05-23 13:00:23,856 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=42542,containerID=container_1400809535638_0013_01_15] is 
running beyond physical memory limits. Current usage: 8.5 GB of 8.5 GB physical 
memory used; 9.6 GB of 17.8 GB virtual memory used. Killing container.
Dump of the process-tree for container_1400809535638_0013_01_15 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)

[jira] [Updated] (SPARK-1930) Container memory beyond limit, were killed

2014-05-26 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1930:
---

Summary: Container memory beyond limit, were killed  (was: When the yarn 
containers occupies 8G memory ,the containers were killed)

 Container memory beyond limit, were killed
 --

 Key: SPARK-1930
 URL: https://issues.apache.org/jira/browse/SPARK-1930
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

 When the containers occupies 8G memory ,the containers were killed
 yarn node manager log:
 {code}
 2014-05-23 13:35:30,776 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
  Container [pid=4947,containerID=container_1400809535638_0015_01_05] is 
 running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB 
 physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing 
 container.
 Dump of the process-tree for container_1400809535638_0015_01_05 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c 
 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill 
 %p' -Xms8192m -Xmx8192m  -Xss2m 
 -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
   -Dlog4j.configuration=log4j-spark-container.properties 
 -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
 -Dspark.akka.frameSize=20 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
 10dian72.domain.test 4 1 
 /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout
  2 
 /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr
  
 |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 
 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill 
 %p -Xms8192m -Xmx8192m -Xss2m 
 -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp
  -Dlog4j.configuration=log4j-spark-container.properties 
 -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 
 -Dspark.akka.frameSize=20 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 
 10dian72.domain.test 4 
 2014-05-23 13:35:30,776 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
  Removed ProcessTree with root 4947
 2014-05-23 13:35:30,776 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1400809535638_0015_01_05 transitioned from RUNNING 
 to KILLING
 2014-05-23 13:35:30,777 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
  Cleaning up container container_1400809535638_0015_01_05
 2014-05-23 13:35:30,788 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
 from container container_1400809535638_0015_01_05 is : 143
 2014-05-23 13:35:30,829 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1400809535638_0015_01_05 transitioned from KILLING 
 to CONTAINER_CLEANEDUP_AFTER_KILL
 2014-05-23 13:35:30,830 INFO 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
 absolute path : 
 /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05
 2014-05-23 13:35:30,830 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark
 OPERATION=Container Finished - Killed   TARGET=ContainerImpl
 RESULT=SUCCESS  APPID=application_1400809535638_0015
 CONTAINERID=container_1400809535638_0015_01_05
 2014-05-23 13:35:30,830 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1400809535638_0015_01_05 transitioned from 
 CONTAINER_CLEANEDUP_AFTER_KILL to DONE
 2014-05-23 13:35:30,830 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Removing container_1400809535638_0015_01_05 from application 
 application_1400809535638_0015
 {code}
 I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}}  
 https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562
 Relative to 8G, 384 MB is too small



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables


 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1931:
--

Description: 
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code}
  val g = Graph(
sc.parallelize(List((0L, a), (1L, b), (2L, c))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
  val gPart = g.partitionBy(EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
{code}

  was:
Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code:scala}
  val g = Graph(
sc.parallelize(List((0L, a), (1L, b), (2L, c))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
  val gPart = g.partitionBy(EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
{code}


 Graph.partitionBy does not reconstruct routing tables
 -

 Key: SPARK-1931
 URL: https://issues.apache.org/jira/browse/SPARK-1931
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
Reporter: Ankur Dave

 Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
 partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
 without updating the routing tables to reflect the new edge layout. This 
 causes the following test to fail:
 {code}
   val g = Graph(
 sc.parallelize(List((0L, a), (1L, b), (2L, c))),
 sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
   assert(g.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
   val gPart = g.partitionBy(EdgePartition2D)
   assert(gPart.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables

Ankur Dave created SPARK-1931:
-

 Summary: Graph.partitionBy does not reconstruct routing tables
 Key: SPARK-1931
 URL: https://issues.apache.org/jira/browse/SPARK-1931
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
Reporter: Ankur Dave


Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy 
where, after repartitioning the edges, it reuses the VertexRDD without updating 
the routing tables to reflect the new edge layout. This causes the following 
test to fail:

{code:scala}
  val g = Graph(
sc.parallelize(List((0L, a), (1L, b), (2L, c))),
sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
  assert(g.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
  val gPart = g.partitionBy(EdgePartition2D)
  assert(gPart.triplets.collect.map(_.toTuple).toSet ===
Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables


[ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008985#comment-14008985
 ] 

Ankur Dave commented on SPARK-1931:
---

The fix is in PR #885: https://github.com/apache/spark/pull/885

 Graph.partitionBy does not reconstruct routing tables
 -

 Key: SPARK-1931
 URL: https://issues.apache.org/jira/browse/SPARK-1931
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
Reporter: Ankur Dave

 Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
 partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
 without updating the routing tables to reflect the new edge layout. This 
 causes the following test to fail:
 {code}
   val g = Graph(
 sc.parallelize(List((0L, a), (1L, b), (2L, c))),
 sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
   assert(g.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
   val gPart = g.partitionBy(EdgePartition2D)
   assert(gPart.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1750) EdgePartition is not serialized properly


 [ 
https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-1750:
--

Comment: was deleted

(was: Resolved in PR #742: https://github.com/apache/spark/pull/742)

 EdgePartition is not serialized properly
 

 Key: SPARK-1750
 URL: https://issues.apache.org/jira/browse/SPARK-1750
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Ankur Dave
 Fix For: 1.0.0


 The GraphX design attempts to avoid moving edges across the network, instead 
 shipping the vertices to the edge partitions. However, Spark sometimes needs 
 to move the edges, such as for straggler mitigation.
 All EdgePartition fields are currently declared transient, so the edges will 
 not be serialized properly. Even if they are not marked transient, Kryo is 
 unable to serialize the EdgePartition, failing with the following error:
 {code}
 java.lang.IllegalArgumentException: Can not set final 
 org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field 
 org.apache.spark.graphx.impl.EdgePartition.index to 
 scala.collection.immutable.$colon$colon
 {code}
 A workaround is to discourage Spark from moving the edges by setting 
 {{spark.locality.wait}} to a high value such as 10.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1750) EdgePartition is not serialized properly


 [ 
https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-1750.
---

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/742

 EdgePartition is not serialized properly
 

 Key: SPARK-1750
 URL: https://issues.apache.org/jira/browse/SPARK-1750
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Ankur Dave
 Fix For: 1.0.0


 The GraphX design attempts to avoid moving edges across the network, instead 
 shipping the vertices to the edge partitions. However, Spark sometimes needs 
 to move the edges, such as for straggler mitigation.
 All EdgePartition fields are currently declared transient, so the edges will 
 not be serialized properly. Even if they are not marked transient, Kryo is 
 unable to serialize the EdgePartition, failing with the following error:
 {code}
 java.lang.IllegalArgumentException: Can not set final 
 org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field 
 org.apache.spark.graphx.impl.EdgePartition.index to 
 scala.collection.immutable.$colon$colon
 {code}
 A workaround is to discourage Spark from moving the edges by setting 
 {{spark.locality.wait}} to a high value such as 10.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1750) EdgePartition is not serialized properly


[ 
https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008988#comment-14008988
 ] 

Ankur Dave commented on SPARK-1750:
---

Resolved in PR #742: https://github.com/apache/spark/pull/742

 EdgePartition is not serialized properly
 

 Key: SPARK-1750
 URL: https://issues.apache.org/jira/browse/SPARK-1750
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Ankur Dave
 Fix For: 1.0.0


 The GraphX design attempts to avoid moving edges across the network, instead 
 shipping the vertices to the edge partitions. However, Spark sometimes needs 
 to move the edges, such as for straggler mitigation.
 All EdgePartition fields are currently declared transient, so the edges will 
 not be serialized properly. Even if they are not marked transient, Kryo is 
 unable to serialize the EdgePartition, failing with the following error:
 {code}
 java.lang.IllegalArgumentException: Can not set final 
 org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field 
 org.apache.spark.graphx.impl.EdgePartition.index to 
 scala.collection.immutable.$colon$colon
 {code}
 A workaround is to discourage Spark from moving the edges by setting 
 {{spark.locality.wait}} to a high value such as 10.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1577) GraphX mapVertices with KryoSerialization


[ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008992#comment-14008992
 ] 

Ankur Dave commented on SPARK-1577:
---

Resolved by re-enabling Kryo reference tracking in #742: 
https://github.com/apache/spark/pull/742

 GraphX mapVertices with KryoSerialization
 -

 Key: SPARK-1577
 URL: https://issues.apache.org/jira/browse/SPARK-1577
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez

 If Kryo is enabled by setting:
 {code}
 SPARK_JAVA_OPTS+=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
  
 SPARK_JAVA_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
   
 {code}
 in conf/spark_env.conf and running the following block of code in the shell:
 {code}
 import org.apache.spark.graphx._
 import org.apache.spark.graphx.lib._
 import org.apache.spark.rdd.RDD
 val vertexArray = Array(
   (1L, (Alice, 28)),
   (2L, (Bob, 27)),
   (3L, (Charlie, 65)),
   (4L, (David, 42)),
   (5L, (Ed, 55)),
   (6L, (Fran, 50))
   )
 val edgeArray = Array(
   Edge(2L, 1L, 7),
   Edge(2L, 4L, 2),
   Edge(3L, 2L, 4),
   Edge(3L, 6L, 3),
   Edge(4L, 1L, 1),
   Edge(5L, 2L, 2),
   Edge(5L, 3L, 8),
   Edge(5L, 6L, 3)
   )
 val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
 val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
 val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
 // Define a class to more clearly model the user property
 case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
 // Transform the graph
 val userGraph = graph.mapVertices{ case (id, (name, age)) = User(name, age, 
 0, 0) }
 {code}
 The following block of code works:
 {code}
 userGraph.vertices.count
 {code}
 and the following block of code generates a Kryo error:
 {code}
 userGraph.vertices.collect
 {code}
 There error:
 {code}
 java.lang.StackOverflowError
   at 
 sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
   at 
 sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
   at java.lang.reflect.Field.get(Field.java:379)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
 {code}
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1329) ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions


[ 
https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008994#comment-14008994
 ] 

Ankur Dave commented on SPARK-1329:
---

This was resolved by #368: https://github.com/apache/spark/pull/368

 ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than 
 node partitions
 

 Key: SPARK-1329
 URL: https://issues.apache.org/jira/browse/SPARK-1329
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0
Reporter: Daniel Darabos

 To reproduce, let's look at a graph with two nodes in one partition, and two 
 edges between them split across two partitions:
 scala val vs = sc.makeRDD(Seq(1L-null, 2L-null), 1)
 scala val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, 
 null)), 2)
 scala val g = graphx.Graph(vs, es)
 Everything seems fine, until GraphX needs to join the two RDDs:
 scala g.triplets.collect
 [...]
 java.lang.ArrayIndexOutOfBoundsException: 1
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an 
 array of length vertices.partitions.size, and then looks up partition IDs 
 from the edges.partitionsRDD in it.
 A graph usually has more edges than nodes. So it is natural to have more edge 
 partitions than node partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1311) Use map side distinct in collect vertex ids from edges graphx


[ 
https://issues.apache.org/jira/browse/SPARK-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008997#comment-14008997
 ] 

Ankur Dave commented on SPARK-1311:
---

This was resolved by PR #497, which removed collectVertexIds and instead 
performed the operation as a side effect of constructing the routing tables: 
https://github.com/apache/spark/pull/497/files#diff-8ea535724b3f014cfef17284b3e783feR397

 Use map side distinct in collect vertex ids from edges graphx
 -

 Key: SPARK-1311
 URL: https://issues.apache.org/jira/browse/SPARK-1311
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Holden Karau
Priority: Minor

 See GRAPH-1



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables


 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1931:
---

Assignee: Ankur Dave

 Graph.partitionBy does not reconstruct routing tables
 -

 Key: SPARK-1931
 URL: https://issues.apache.org/jira/browse/SPARK-1931
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
Reporter: Ankur Dave
Assignee: Ankur Dave

 Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
 partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
 without updating the routing tables to reflect the new edge layout. This 
 causes the following test to fail:
 {code}
   val g = Graph(
 sc.parallelize(List((0L, a), (1L, b), (2L, c))),
 sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
   assert(g.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
   val gPart = g.partitionBy(EdgePartition2D)
   assert(gPart.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1311) Use map side distinct in collect vertex ids from edges graphx


 [ 
https://issues.apache.org/jira/browse/SPARK-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1311:
---

Assignee: Ankur Dave

 Use map side distinct in collect vertex ids from edges graphx
 -

 Key: SPARK-1311
 URL: https://issues.apache.org/jira/browse/SPARK-1311
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Holden Karau
Assignee: Ankur Dave
Priority: Minor

 See GRAPH-1



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1329) ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions


 [ 
https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1329.


   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Ankur Dave

 ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than 
 node partitions
 

 Key: SPARK-1329
 URL: https://issues.apache.org/jira/browse/SPARK-1329
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0
Reporter: Daniel Darabos
Assignee: Ankur Dave
 Fix For: 1.0.0


 To reproduce, let's look at a graph with two nodes in one partition, and two 
 edges between them split across two partitions:
 scala val vs = sc.makeRDD(Seq(1L-null, 2L-null), 1)
 scala val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, 
 null)), 2)
 scala val g = graphx.Graph(vs, es)
 Everything seems fine, until GraphX needs to join the two RDDs:
 scala g.triplets.collect
 [...]
 java.lang.ArrayIndexOutOfBoundsException: 1
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75)
   at 
 org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
   at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an 
 array of length vertices.partitions.size, and then looks up partition IDs 
 from the edges.partitionsRDD in it.
 A graph usually has more edges than nodes. So it is natural to have more edge 
 partitions than node partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1577) GraphX mapVertices with KryoSerialization


 [ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1577.


   Resolution: Fixed
Fix Version/s: 1.0.0

 GraphX mapVertices with KryoSerialization
 -

 Key: SPARK-1577
 URL: https://issues.apache.org/jira/browse/SPARK-1577
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez
Assignee: Ankur Dave
 Fix For: 1.0.0


 If Kryo is enabled by setting:
 {code}
 SPARK_JAVA_OPTS+=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
  
 SPARK_JAVA_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
   
 {code}
 in conf/spark_env.conf and running the following block of code in the shell:
 {code}
 import org.apache.spark.graphx._
 import org.apache.spark.graphx.lib._
 import org.apache.spark.rdd.RDD
 val vertexArray = Array(
   (1L, (Alice, 28)),
   (2L, (Bob, 27)),
   (3L, (Charlie, 65)),
   (4L, (David, 42)),
   (5L, (Ed, 55)),
   (6L, (Fran, 50))
   )
 val edgeArray = Array(
   Edge(2L, 1L, 7),
   Edge(2L, 4L, 2),
   Edge(3L, 2L, 4),
   Edge(3L, 6L, 3),
   Edge(4L, 1L, 1),
   Edge(5L, 2L, 2),
   Edge(5L, 3L, 8),
   Edge(5L, 6L, 3)
   )
 val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
 val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
 val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
 // Define a class to more clearly model the user property
 case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
 // Transform the graph
 val userGraph = graph.mapVertices{ case (id, (name, age)) = User(name, age, 
 0, 0) }
 {code}
 The following block of code works:
 {code}
 userGraph.vertices.count
 {code}
 and the following block of code generates a Kryo error:
 {code}
 userGraph.vertices.collect
 {code}
 There error:
 {code}
 java.lang.StackOverflowError
   at 
 sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
   at 
 sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
   at java.lang.reflect.Field.get(Field.java:379)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
 {code}
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1750) EdgePartition is not serialized properly


 [ 
https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1750:
---

Assignee: Joseph E. Gonzalez

 EdgePartition is not serialized properly
 

 Key: SPARK-1750
 URL: https://issues.apache.org/jira/browse/SPARK-1750
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Ankur Dave
Assignee: Joseph E. Gonzalez
 Fix For: 1.0.0


 The GraphX design attempts to avoid moving edges across the network, instead 
 shipping the vertices to the edge partitions. However, Spark sometimes needs 
 to move the edges, such as for straggler mitigation.
 All EdgePartition fields are currently declared transient, so the edges will 
 not be serialized properly. Even if they are not marked transient, Kryo is 
 unable to serialize the EdgePartition, failing with the following error:
 {code}
 java.lang.IllegalArgumentException: Can not set final 
 org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field 
 org.apache.spark.graphx.impl.EdgePartition.index to 
 scala.collection.immutable.$colon$colon
 {code}
 A workaround is to discourage Spark from moving the edges by setting 
 {{spark.locality.wait}} to a high value such as 10.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1311) Use map side distinct in collect vertex ids from edges graphx


 [ 
https://issues.apache.org/jira/browse/SPARK-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-1311.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Use map side distinct in collect vertex ids from edges graphx
 -

 Key: SPARK-1311
 URL: https://issues.apache.org/jira/browse/SPARK-1311
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Holden Karau
Assignee: Ankur Dave
Priority: Minor
 Fix For: 1.0.0


 See GRAPH-1



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1577) GraphX mapVertices with KryoSerialization


 [ 
https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1577:
---

Assignee: Ankur Dave

 GraphX mapVertices with KryoSerialization
 -

 Key: SPARK-1577
 URL: https://issues.apache.org/jira/browse/SPARK-1577
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez
Assignee: Ankur Dave
 Fix For: 1.0.0


 If Kryo is enabled by setting:
 {code}
 SPARK_JAVA_OPTS+=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
  
 SPARK_JAVA_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator
   
 {code}
 in conf/spark_env.conf and running the following block of code in the shell:
 {code}
 import org.apache.spark.graphx._
 import org.apache.spark.graphx.lib._
 import org.apache.spark.rdd.RDD
 val vertexArray = Array(
   (1L, (Alice, 28)),
   (2L, (Bob, 27)),
   (3L, (Charlie, 65)),
   (4L, (David, 42)),
   (5L, (Ed, 55)),
   (6L, (Fran, 50))
   )
 val edgeArray = Array(
   Edge(2L, 1L, 7),
   Edge(2L, 4L, 2),
   Edge(3L, 2L, 4),
   Edge(3L, 6L, 3),
   Edge(4L, 1L, 1),
   Edge(5L, 2L, 2),
   Edge(5L, 3L, 8),
   Edge(5L, 6L, 3)
   )
 val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
 val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
 val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
 // Define a class to more clearly model the user property
 case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
 // Transform the graph
 val userGraph = graph.mapVertices{ case (id, (name, age)) = User(name, age, 
 0, 0) }
 {code}
 The following block of code works:
 {code}
 userGraph.vertices.count
 {code}
 and the following block of code generates a Kryo error:
 {code}
 userGraph.vertices.collect
 {code}
 There error:
 {code}
 java.lang.StackOverflowError
   at 
 sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54)
   at 
 sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38)
   at java.lang.reflect.Field.get(Field.java:379)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
 {code}
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk


 [ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1518:
---

Assignee: Colin Patrick McCabe

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

[
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009145#comment-14009145
]

Patrick Wendell commented on SPARK-983:
---

We are actually looking at this problem in a few different places in the code
base (we did this already for the external aggregations, and we also have
SPARK-1777).

Relying on GC's to decide when to spill is an interesting approach, but I'd
rather have control of the heuristics ourselves. I think you'd get this
thrashing behavior where a GC occurred and suddenly a million threads start
writing to disk. In the past we've used a different mechanism (the size
estimator) which approximates memory usage.

It might make sense to introduce a simple memory allocation mechanism that is
shared between the external aggregation maps, partition unrolling, etc. This is
something where a design doc would be helpful.

Support external sorting for RDD#sortByKey()

Key: SPARK-983
URL: https://issues.apache.org/jira/browse/SPARK-983
Project: Spark
Issue Type: New Feature
Affects Versions: 0.9.0
Reporter: Reynold Xin

Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a
buffer to hold the entire partition, then sorts it. This will cause an OOM if
an entire partition cannot fit in memory, which is especially problematic for
skewed data. Rather than OOMing, the behavior should be similar to the
[ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
where we fallback to disk if we detect memory pressure.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1784) Add a partitioner which partitions an RDD with each partition having specified # of keys


[ 
https://issues.apache.org/jira/browse/SPARK-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009184#comment-14009184
 ] 

Patrick Wendell commented on SPARK-1784:


I think this might be subsumed by the fix to SPARK-1770.

https://github.com/apache/spark/pull/727/files

 Add a partitioner which partitions an RDD with each partition having 
 specified # of keys
 

 Key: SPARK-1784
 URL: https://issues.apache.org/jira/browse/SPARK-1784
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Syed A. Hashmi
Priority: Minor
 Fix For: 1.0.0


 At times on mailing lists, I have seen people complaining about having no 
 control over # of keys per partition. RangePartitioner partitions keys in to 
 roughly equal sized partitions, but in cases where user wants full control 
 over specifying exact size, it is not possible today.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables


 [ 
https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-1931.
---

   Resolution: Fixed
Fix Version/s: 1.0.0

 Graph.partitionBy does not reconstruct routing tables
 -

 Key: SPARK-1931
 URL: https://issues.apache.org/jira/browse/SPARK-1931
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
Reporter: Ankur Dave
Assignee: Ankur Dave
 Fix For: 1.0.0


 Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in 
 partitionBy where, after repartitioning the edges, it reuses the VertexRDD 
 without updating the routing tables to reflect the new edge layout. This 
 causes the following test to fail:
 {code}
   val g = Graph(
 sc.parallelize(List((0L, a), (1L, b), (2L, c))),
 sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2))
   assert(g.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
   val gPart = g.partitionBy(EdgePartition2D)
   assert(gPart.triplets.collect.map(_.toTuple).toSet ===
 Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1925) Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid

2014-05-26 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-1925.
-

   Resolution: Fixed
Fix Version/s: 1.0.0

 Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid
 --

 Key: SPARK-1925
 URL: https://issues.apache.org/jira/browse/SPARK-1925
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Shixiong Zhu
Priority: Minor
  Labels: easyfix
 Fix For: 1.0.0


 I believe this is a typo:
 {code}
   if ((level  0)  (parentFilters.length == 0)) {
 return false
   }
 {code}
 Should use  here.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1932) Race conditions in BlockManager.cachedPeers and ConnectionManager.onReceiveCallback

2014-05-26 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-1932:
---

 Summary: Race conditions in BlockManager.cachedPeers and 
ConnectionManager.onReceiveCallback
 Key: SPARK-1932
 URL: https://issues.apache.org/jira/browse/SPARK-1932
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu
Priority: Minor


BlockManager.cachedPeers and  ConnectionManager.onReceiveCallback are read and 
written in different threads without proper protection.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1933) FileNotFoundException when a directory is passed to SparkContext.addJar/addFile

Reynold Xin created SPARK-1933:
--

 Summary: FileNotFoundException when a directory is passed to 
SparkContext.addJar/addFile
 Key: SPARK-1933
 URL: https://issues.apache.org/jira/browse/SPARK-1933
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.0.1


When SparkContext.addJar/addFile is used to add a directory (which is not 
supported), the runtime exception is 
{code}
java.io.FileNotFoundException: [file] (No such file or directory)
{code}

This exception is extremely confusing because the directory does exist. We 
should throw a more meaningful exception when a directory is passed to 
addJar/addFile.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1933) FileNotFoundException when a directory is passed to SparkContext.addJar/addFile


[ 
https://issues.apache.org/jira/browse/SPARK-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009239#comment-14009239
 ] 

Reynold Xin commented on SPARK-1933:


Pull request added https://github.com/apache/spark/pull/888

 FileNotFoundException when a directory is passed to 
 SparkContext.addJar/addFile
 ---

 Key: SPARK-1933
 URL: https://issues.apache.org/jira/browse/SPARK-1933
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.0.1


 When SparkContext.addJar/addFile is used to add a directory (which is not 
 supported), the runtime exception is 
 {code}
 java.io.FileNotFoundException: [file] (No such file or directory)
 {code}
 This exception is extremely confusing because the directory does exist. We 
 should throw a more meaningful exception when a directory is passed to 
 addJar/addFile.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1935) Set commons-codec 1.4 as a dependency

2014-05-26 Thread Yin Huai (JIRA)

Yin Huai created SPARK-1935:
---

 Summary: Set commons-codec 1.4 as a dependency
 Key: SPARK-1935
 URL: https://issues.apache.org/jira/browse/SPARK-1935
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Yin Huai
Priority: Minor


Right now, commons-codec is a transitive dependency. When Spark is built by 
maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency


[ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009283#comment-14009283
 ] 

Patrick Wendell edited comment on SPARK-1935 at 5/27/14 5:11 AM:
-

Does commons-codec 1.4 really break compatibility with commons-codec 1.3? Or is 
the issue just that Hadoop is compiled against 1.4 but maven is selecting 1.3 
and so the new 1.4 functions aren't available.

Also, would you mind giving the exact permutation of the build that is causing 
this error? I just want to see if it's also a problem in Spark 1.0 or if it's 
only in 0.9.


was (Author: pwendell):
Does commons-codec 1.4 really break compatibility with commons-codec 1.3? Or is 
the issue just that Hadoop is compiled against 1.4 but maven is selecting 1.3 
and so the new 1.4 functions aren't available.

 Explicitly add commons-codec 1.4 as a dependency
 

 Key: SPARK-1935
 URL: https://issues.apache.org/jira/browse/SPARK-1935
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Yin Huai
Priority: Minor

 Right now, commons-codec is a transitive dependency. When Spark is built by 
 maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
 older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
 problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency