[jira] [Commented] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008629#comment-14008629 ] Cheng Lian commented on SPARK-1914: --- Added steps to reproduce this bug in {{sbt hive/console}}. Simplify CountFunction not to traverse to evaluate all child expressions. - Key: SPARK-1914 URL: https://issues.apache.org/jira/browse/SPARK-1914 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. To reproduce this bug in {{sbt hive/console}}: {code} scala hql(SELECT COUNT(*) FROM src1).collect() res1: Array[org.apache.spark.sql.Row] = Array([25]) scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect() res2: Array[org.apache.spark.sql.Row] = Array([10]) scala hql(SELECT COUNT(key + 1) FROM src1).collect() res3: Array[org.apache.spark.sql.Row] = Array([25]) {code} {{res3}} should be 15 since there are 10 null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1914: -- Description: {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. To reproduce this bug in {{sbt hive/console}}: {code} scala hql(SELECT COUNT(*) FROM src1).collect() res1: Array[org.apache.spark.sql.Row] = Array([25]) scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect() res2: Array[org.apache.spark.sql.Row] = Array([10]) scala hql(SELECT COUNT(key + 1) FROM src1).collect() res3: Array[org.apache.spark.sql.Row] = Array([25]) {code} {{res3}} should be 15 since there are 10 null keys. was: {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. Simplify CountFunction not to traverse to evaluate all child expressions. - Key: SPARK-1914 URL: https://issues.apache.org/jira/browse/SPARK-1914 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. To reproduce this bug in {{sbt hive/console}}: {code} scala hql(SELECT COUNT(*) FROM src1).collect() res1: Array[org.apache.spark.sql.Row] = Array([25]) scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect() res2: Array[org.apache.spark.sql.Row] = Array([10]) scala hql(SELECT COUNT(key + 1) FROM src1).collect() res3: Array[org.apache.spark.sql.Row] = Array([25]) {code} {{res3}} should be 15 since there are 10 null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1914. Resolution: Fixed Fix Version/s: 1.0.0 Simplify CountFunction not to traverse to evaluate all child expressions. - Key: SPARK-1914 URL: https://issues.apache.org/jira/browse/SPARK-1914 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Fix For: 1.0.0 {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. To reproduce this bug in {{sbt hive/console}}: {code} scala hql(SELECT COUNT(*) FROM src1).collect() res1: Array[org.apache.spark.sql.Row] = Array([25]) scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect() res2: Array[org.apache.spark.sql.Row] = Array([10]) scala hql(SELECT COUNT(key + 1) FROM src1).collect() res3: Array[org.apache.spark.sql.Row] = Array([25]) {code} {{res3}} should be 15 since there are 10 null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1914: --- Assignee: Takuya Ueshin Simplify CountFunction not to traverse to evaluate all child expressions. - Key: SPARK-1914 URL: https://issues.apache.org/jira/browse/SPARK-1914 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.0.0 {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. To reproduce this bug in {{sbt hive/console}}: {code} scala hql(SELECT COUNT(*) FROM src1).collect() res1: Array[org.apache.spark.sql.Row] = Array([25]) scala hql(SELECT COUNT(*) FROM src1 WHERE key IS NULL).collect() res2: Array[org.apache.spark.sql.Row] = Array([10]) scala hql(SELECT COUNT(key + 1) FROM src1).collect() res3: Array[org.apache.spark.sql.Row] = Array([25]) {code} {{res3}} should be 15 since there are 10 null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1925) Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid
[ https://issues.apache.org/jira/browse/SPARK-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1925: --- Description: I believe this is a typo: {code} if ((level 0) (parentFilters.length == 0)) { return false } {code} Should use here. was: I believe this is a typo: if ((level 0) (parentFilters.length == 0)) { return false } Should use here. Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid -- Key: SPARK-1925 URL: https://issues.apache.org/jira/browse/SPARK-1925 Project: Spark Issue Type: Bug Components: MLlib Reporter: Shixiong Zhu Priority: Minor Labels: easyfix I believe this is a typo: {code} if ((level 0) (parentFilters.length == 0)) { return false } {code} Should use here. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1926) Nullability of Max/Min/First should be true.
[ https://issues.apache.org/jira/browse/SPARK-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008631#comment-14008631 ] Takuya Ueshin commented on SPARK-1926: -- PRed: https://github.com/apache/spark/pull/881 Nullability of Max/Min/First should be true. Key: SPARK-1926 URL: https://issues.apache.org/jira/browse/SPARK-1926 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Nullability of {{Max}}/{{Min}}/{{First}} should be {{true}} because they return {{null}} if there are no rows. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1329) ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions
[ https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008771#comment-14008771 ] Daniel Darabos commented on SPARK-1329: --- Sorry, I haven't looked into fixing this. We ended up not using GraphX, so we are no longer affected by the bug. ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions Key: SPARK-1329 URL: https://issues.apache.org/jira/browse/SPARK-1329 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0 Reporter: Daniel Darabos To reproduce, let's look at a graph with two nodes in one partition, and two edges between them split across two partitions: scala val vs = sc.makeRDD(Seq(1L-null, 2L-null), 1) scala val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, null)), 2) scala val g = graphx.Graph(vs, es) Everything seems fine, until GraphX needs to join the two RDDs: scala g.triplets.collect [...] java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an array of length vertices.partitions.size, and then looks up partition IDs from the edges.partitionsRDD in it. A graph usually has more edges than nodes. So it is natural to have more edge partitions than node partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1927) Implicits declared in companion objects not found in Spark shell
Piotr Kołaczkowski created SPARK-1927: - Summary: Implicits declared in companion objects not found in Spark shell Key: SPARK-1927 URL: https://issues.apache.org/jira/browse/SPARK-1927 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Environment: Ubuntu Linux 14.04, Oracle Java 7u55 Reporter: Piotr Kołaczkowski {code} scala :paste // Entering paste mode (ctrl-D to finish) trait Mapper[T] class Foo object Foo { implicit object FooMapper extends Mapper[Foo] } // Exiting paste mode, now interpreting. defined trait Mapper defined class Foo defined module Foo scala implicitly[Mapper[Foo]] console:28: error: could not find implicit value for parameter e: Mapper[Foo] implicitly[Mapper[Foo]] ^ {code} Exactly same example in the official Scala REPL (2.10.4): {code} scala :paste // Entering paste mode (ctrl-D to finish) trait Mapper[T] class Foo object Foo { implicit object FooMapper extends Mapper[Foo] } // Exiting paste mode, now interpreting. defined trait Mapper defined class Foo defined module Foo scala implicitly[Mapper[Foo]] res0: Mapper[Foo] = Foo$FooMapper$@4a20e9c6 {code} I guess it might be another manifestation of the problem of everything being an inner object in Spark Repl. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1929) DAGScheduler suspended by local task OOM
Zhen Peng created SPARK-1929: Summary: DAGScheduler suspended by local task OOM Key: SPARK-1929 URL: https://issues.apache.org/jira/browse/SPARK-1929 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Zhen Peng Fix For: 1.0.0 DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1928) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008850#comment-14008850 ] Guoqiang Li commented on SPARK-1928: How to reproduce the issue? DAGScheduler suspended by local task OOM Key: SPARK-1928 URL: https://issues.apache.org/jira/browse/SPARK-1928 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Zhen Peng Fix For: 1.0.0 DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1930) When the yarn containers occupies 8G memory ,the containers were killed
[ https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1930: --- Description: When the containers occupies 8G memory ,the containers were killed yarn node manager log: {code} 2014-05-23 13:35:30,776 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=4947,containerID=container_1400809535638_0015_01_05] is running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1400809535638_0015_01_05 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms8192m -Xmx8192m -Xss2m -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp -Dlog4j.configuration=log4j-spark-container.properties -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 -Dspark.akka.frameSize=20 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 10dian72.domain.test 4 1 /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout 2 /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms8192m -Xmx8192m -Xss2m -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp -Dlog4j.configuration=log4j-spark-container.properties -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 -Dspark.akka.frameSize=20 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 10dian72.domain.test 4 2014-05-23 13:35:30,776 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Removed ProcessTree with root 4947 2014-05-23 13:35:30,776 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1400809535638_0015_01_05 transitioned from RUNNING to KILLING 2014-05-23 13:35:30,777 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1400809535638_0015_01_05 2014-05-23 13:35:30,788 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1400809535638_0015_01_05 is : 143 2014-05-23 13:35:30,829 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1400809535638_0015_01_05 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark OPERATION=Container Finished - Killed TARGET=ContainerImplRESULT=SUCCESS APPID=application_1400809535638_0015 CONTAINERID=container_1400809535638_0015_01_05 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1400809535638_0015_01_05 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1400809535638_0015_01_05 from application application_1400809535638_0015 {code} I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}} https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562 Relative to 8G, 384 MB is too small was: When the containers occupies 8G memory ,the containers were killed yarn node manager log: {code} 2014-05-23 13:00:23,856 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=42542,containerID=container_1400809535638_0013_01_15] is running beyond physical memory limits. Current usage: 8.5 GB of 8.5 GB physical memory used; 9.6 GB of 17.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1400809535638_0013_01_15 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
[jira] [Updated] (SPARK-1930) Container memory beyond limit, were killed
[ https://issues.apache.org/jira/browse/SPARK-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1930: --- Summary: Container memory beyond limit, were killed (was: When the yarn containers occupies 8G memory ,the containers were killed) Container memory beyond limit, were killed -- Key: SPARK-1930 URL: https://issues.apache.org/jira/browse/SPARK-1930 Project: Spark Issue Type: Bug Components: YARN Reporter: Guoqiang Li When the containers occupies 8G memory ,the containers were killed yarn node manager log: {code} 2014-05-23 13:35:30,776 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=4947,containerID=container_1400809535638_0015_01_05] is running beyond physical memory limits. Current usage: 8.6 GB of 8.5 GB physical memory used; 10.0 GB of 17.8 GB virtual memory used. Killing container. Dump of the process-tree for container_1400809535638_0015_01_05 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 4947 25417 4947 4947 (bash) 0 0 110804992 335 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms8192m -Xmx8192m -Xss2m -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp -Dlog4j.configuration=log4j-spark-container.properties -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 -Dspark.akka.frameSize=20 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 10dian72.domain.test 4 1 /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stdout 2 /var/log/hadoop-yarn/container/application_1400809535638_0015/container_1400809535638_0015_01_05/stderr |- 4957 4947 4947 4947 (java) 157809 12620 10667016192 2245522 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms8192m -Xmx8192m -Xss2m -Djava.io.tmpdir=/yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05/tmp -Dlog4j.configuration=log4j-spark-container.properties -Dspark.akka.askTimeout=120 -Dspark.akka.timeout=120 -Dspark.akka.frameSize=20 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sp...@10dian71.domain.test:45477/user/CoarseGrainedScheduler 3 10dian72.domain.test 4 2014-05-23 13:35:30,776 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Removed ProcessTree with root 4947 2014-05-23 13:35:30,776 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1400809535638_0015_01_05 transitioned from RUNNING to KILLING 2014-05-23 13:35:30,777 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1400809535638_0015_01_05 2014-05-23 13:35:30,788 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1400809535638_0015_01_05 is : 143 2014-05-23 13:35:30,829 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1400809535638_0015_01_05 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/spark/appcache/application_1400809535638_0015/container_1400809535638_0015_01_05 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=spark OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1400809535638_0015 CONTAINERID=container_1400809535638_0015_01_05 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1400809535638_0015_01_05 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 2014-05-23 13:35:30,830 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1400809535638_0015_01_05 from application application_1400809535638_0015 {code} I think it should be related with {{YarnAllocationHandler.MEMORY_OVERHEA}} https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala#L562 Relative to 8G, 384 MB is too small -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-1931: -- Description: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} was: Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code:scala} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} Graph.partitionBy does not reconstruct routing tables - Key: SPARK-1931 URL: https://issues.apache.org/jira/browse/SPARK-1931 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Reporter: Ankur Dave Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
Ankur Dave created SPARK-1931: - Summary: Graph.partitionBy does not reconstruct routing tables Key: SPARK-1931 URL: https://issues.apache.org/jira/browse/SPARK-1931 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Reporter: Ankur Dave Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code:scala} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008985#comment-14008985 ] Ankur Dave commented on SPARK-1931: --- The fix is in PR #885: https://github.com/apache/spark/pull/885 Graph.partitionBy does not reconstruct routing tables - Key: SPARK-1931 URL: https://issues.apache.org/jira/browse/SPARK-1931 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Reporter: Ankur Dave Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1750) EdgePartition is not serialized properly
[ https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-1750: -- Comment: was deleted (was: Resolved in PR #742: https://github.com/apache/spark/pull/742) EdgePartition is not serialized properly Key: SPARK-1750 URL: https://issues.apache.org/jira/browse/SPARK-1750 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Ankur Dave Fix For: 1.0.0 The GraphX design attempts to avoid moving edges across the network, instead shipping the vertices to the edge partitions. However, Spark sometimes needs to move the edges, such as for straggler mitigation. All EdgePartition fields are currently declared transient, so the edges will not be serialized properly. Even if they are not marked transient, Kryo is unable to serialize the EdgePartition, failing with the following error: {code} java.lang.IllegalArgumentException: Can not set final org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field org.apache.spark.graphx.impl.EdgePartition.index to scala.collection.immutable.$colon$colon {code} A workaround is to discourage Spark from moving the edges by setting {{spark.locality.wait}} to a high value such as 10. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1750) EdgePartition is not serialized properly
[ https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-1750. --- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/742 EdgePartition is not serialized properly Key: SPARK-1750 URL: https://issues.apache.org/jira/browse/SPARK-1750 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Ankur Dave Fix For: 1.0.0 The GraphX design attempts to avoid moving edges across the network, instead shipping the vertices to the edge partitions. However, Spark sometimes needs to move the edges, such as for straggler mitigation. All EdgePartition fields are currently declared transient, so the edges will not be serialized properly. Even if they are not marked transient, Kryo is unable to serialize the EdgePartition, failing with the following error: {code} java.lang.IllegalArgumentException: Can not set final org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field org.apache.spark.graphx.impl.EdgePartition.index to scala.collection.immutable.$colon$colon {code} A workaround is to discourage Spark from moving the edges by setting {{spark.locality.wait}} to a high value such as 10. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1750) EdgePartition is not serialized properly
[ https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008988#comment-14008988 ] Ankur Dave commented on SPARK-1750: --- Resolved in PR #742: https://github.com/apache/spark/pull/742 EdgePartition is not serialized properly Key: SPARK-1750 URL: https://issues.apache.org/jira/browse/SPARK-1750 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Ankur Dave Fix For: 1.0.0 The GraphX design attempts to avoid moving edges across the network, instead shipping the vertices to the edge partitions. However, Spark sometimes needs to move the edges, such as for straggler mitigation. All EdgePartition fields are currently declared transient, so the edges will not be serialized properly. Even if they are not marked transient, Kryo is unable to serialize the EdgePartition, failing with the following error: {code} java.lang.IllegalArgumentException: Can not set final org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field org.apache.spark.graphx.impl.EdgePartition.index to scala.collection.immutable.$colon$colon {code} A workaround is to discourage Spark from moving the edges by setting {{spark.locality.wait}} to a high value such as 10. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1577) GraphX mapVertices with KryoSerialization
[ https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008992#comment-14008992 ] Ankur Dave commented on SPARK-1577: --- Resolved by re-enabling Kryo reference tracking in #742: https://github.com/apache/spark/pull/742 GraphX mapVertices with KryoSerialization - Key: SPARK-1577 URL: https://issues.apache.org/jira/browse/SPARK-1577 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez If Kryo is enabled by setting: {code} SPARK_JAVA_OPTS+=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer SPARK_JAVA_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator {code} in conf/spark_env.conf and running the following block of code in the shell: {code} import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ import org.apache.spark.rdd.RDD val vertexArray = Array( (1L, (Alice, 28)), (2L, (Bob, 27)), (3L, (Charlie, 65)), (4L, (David, 42)), (5L, (Ed, 55)), (6L, (Fran, 50)) ) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 6L, 3), Edge(4L, 1L, 1), Edge(5L, 2L, 2), Edge(5L, 3L, 8), Edge(5L, 6L, 3) ) val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD) // Define a class to more clearly model the user property case class User(name: String, age: Int, inDeg: Int, outDeg: Int) // Transform the graph val userGraph = graph.mapVertices{ case (id, (name, age)) = User(name, age, 0, 0) } {code} The following block of code works: {code} userGraph.vertices.count {code} and the following block of code generates a Kryo error: {code} userGraph.vertices.collect {code} There error: {code} java.lang.StackOverflowError at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) at sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) at java.lang.reflect.Field.get(Field.java:379) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1329) ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions
[ https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008994#comment-14008994 ] Ankur Dave commented on SPARK-1329: --- This was resolved by #368: https://github.com/apache/spark/pull/368 ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions Key: SPARK-1329 URL: https://issues.apache.org/jira/browse/SPARK-1329 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0 Reporter: Daniel Darabos To reproduce, let's look at a graph with two nodes in one partition, and two edges between them split across two partitions: scala val vs = sc.makeRDD(Seq(1L-null, 2L-null), 1) scala val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, null)), 2) scala val g = graphx.Graph(vs, es) Everything seems fine, until GraphX needs to join the two RDDs: scala g.triplets.collect [...] java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an array of length vertices.partitions.size, and then looks up partition IDs from the edges.partitionsRDD in it. A graph usually has more edges than nodes. So it is natural to have more edge partitions than node partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1311) Use map side distinct in collect vertex ids from edges graphx
[ https://issues.apache.org/jira/browse/SPARK-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008997#comment-14008997 ] Ankur Dave commented on SPARK-1311: --- This was resolved by PR #497, which removed collectVertexIds and instead performed the operation as a side effect of constructing the routing tables: https://github.com/apache/spark/pull/497/files#diff-8ea535724b3f014cfef17284b3e783feR397 Use map side distinct in collect vertex ids from edges graphx - Key: SPARK-1311 URL: https://issues.apache.org/jira/browse/SPARK-1311 Project: Spark Issue Type: Bug Components: GraphX Reporter: Holden Karau Priority: Minor See GRAPH-1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1931: --- Assignee: Ankur Dave Graph.partitionBy does not reconstruct routing tables - Key: SPARK-1931 URL: https://issues.apache.org/jira/browse/SPARK-1931 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Reporter: Ankur Dave Assignee: Ankur Dave Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1311) Use map side distinct in collect vertex ids from edges graphx
[ https://issues.apache.org/jira/browse/SPARK-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1311: --- Assignee: Ankur Dave Use map side distinct in collect vertex ids from edges graphx - Key: SPARK-1311 URL: https://issues.apache.org/jira/browse/SPARK-1311 Project: Spark Issue Type: Bug Components: GraphX Reporter: Holden Karau Assignee: Ankur Dave Priority: Minor See GRAPH-1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1329) ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions
[ https://issues.apache.org/jira/browse/SPARK-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1329. Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Ankur Dave ArrayIndexOutOfBoundsException if graphx.Graph has more edge partitions than node partitions Key: SPARK-1329 URL: https://issues.apache.org/jira/browse/SPARK-1329 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0 Reporter: Daniel Darabos Assignee: Ankur Dave Fix For: 1.0.0 To reproduce, let's look at a graph with two nodes in one partition, and two edges between them split across two partitions: scala val vs = sc.makeRDD(Seq(1L-null, 2L-null), 1) scala val es = sc.makeRDD(Seq(graphx.Edge(1, 2, null), graphx.Edge(2, 1, null)), 2) scala val g = graphx.Graph(vs, es) Everything seems fine, until GraphX needs to join the two RDDs: scala g.triplets.collect [...] java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:76) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2$$anonfun$apply$3.apply(RoutingTable.scala:75) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:75) at org.apache.spark.graphx.impl.RoutingTable$$anonfun$2.apply(RoutingTable.scala:73) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:71) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:85) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The bug is fairly obvious in RoutingTable.createPid2Vid() -- it creates an array of length vertices.partitions.size, and then looks up partition IDs from the edges.partitionsRDD in it. A graph usually has more edges than nodes. So it is natural to have more edge partitions than node partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1577) GraphX mapVertices with KryoSerialization
[ https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1577. Resolution: Fixed Fix Version/s: 1.0.0 GraphX mapVertices with KryoSerialization - Key: SPARK-1577 URL: https://issues.apache.org/jira/browse/SPARK-1577 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez Assignee: Ankur Dave Fix For: 1.0.0 If Kryo is enabled by setting: {code} SPARK_JAVA_OPTS+=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer SPARK_JAVA_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator {code} in conf/spark_env.conf and running the following block of code in the shell: {code} import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ import org.apache.spark.rdd.RDD val vertexArray = Array( (1L, (Alice, 28)), (2L, (Bob, 27)), (3L, (Charlie, 65)), (4L, (David, 42)), (5L, (Ed, 55)), (6L, (Fran, 50)) ) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 6L, 3), Edge(4L, 1L, 1), Edge(5L, 2L, 2), Edge(5L, 3L, 8), Edge(5L, 6L, 3) ) val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD) // Define a class to more clearly model the user property case class User(name: String, age: Int, inDeg: Int, outDeg: Int) // Transform the graph val userGraph = graph.mapVertices{ case (id, (name, age)) = User(name, age, 0, 0) } {code} The following block of code works: {code} userGraph.vertices.count {code} and the following block of code generates a Kryo error: {code} userGraph.vertices.collect {code} There error: {code} java.lang.StackOverflowError at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) at sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) at java.lang.reflect.Field.get(Field.java:379) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1750) EdgePartition is not serialized properly
[ https://issues.apache.org/jira/browse/SPARK-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1750: --- Assignee: Joseph E. Gonzalez EdgePartition is not serialized properly Key: SPARK-1750 URL: https://issues.apache.org/jira/browse/SPARK-1750 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Ankur Dave Assignee: Joseph E. Gonzalez Fix For: 1.0.0 The GraphX design attempts to avoid moving edges across the network, instead shipping the vertices to the edge partitions. However, Spark sometimes needs to move the edges, such as for straggler mitigation. All EdgePartition fields are currently declared transient, so the edges will not be serialized properly. Even if they are not marked transient, Kryo is unable to serialize the EdgePartition, failing with the following error: {code} java.lang.IllegalArgumentException: Can not set final org.apache.spark.graphx.util.collection.PrimitiveKeyOpenHashMap field org.apache.spark.graphx.impl.EdgePartition.index to scala.collection.immutable.$colon$colon {code} A workaround is to discourage Spark from moving the edges by setting {{spark.locality.wait}} to a high value such as 10. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1311) Use map side distinct in collect vertex ids from edges graphx
[ https://issues.apache.org/jira/browse/SPARK-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-1311. Resolution: Fixed Fix Version/s: 1.0.0 Use map side distinct in collect vertex ids from edges graphx - Key: SPARK-1311 URL: https://issues.apache.org/jira/browse/SPARK-1311 Project: Spark Issue Type: Bug Components: GraphX Reporter: Holden Karau Assignee: Ankur Dave Priority: Minor Fix For: 1.0.0 See GRAPH-1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1577) GraphX mapVertices with KryoSerialization
[ https://issues.apache.org/jira/browse/SPARK-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1577: --- Assignee: Ankur Dave GraphX mapVertices with KryoSerialization - Key: SPARK-1577 URL: https://issues.apache.org/jira/browse/SPARK-1577 Project: Spark Issue Type: Bug Components: GraphX Reporter: Joseph E. Gonzalez Assignee: Ankur Dave Fix For: 1.0.0 If Kryo is enabled by setting: {code} SPARK_JAVA_OPTS+=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer SPARK_JAVA_OPTS+=-Dspark.kryo.registrator=org.apache.spark.graphx.GraphKryoRegistrator {code} in conf/spark_env.conf and running the following block of code in the shell: {code} import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ import org.apache.spark.rdd.RDD val vertexArray = Array( (1L, (Alice, 28)), (2L, (Bob, 27)), (3L, (Charlie, 65)), (4L, (David, 42)), (5L, (Ed, 55)), (6L, (Fran, 50)) ) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 6L, 3), Edge(4L, 1L, 1), Edge(5L, 2L, 2), Edge(5L, 3L, 8), Edge(5L, 6L, 3) ) val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD) // Define a class to more clearly model the user property case class User(name: String, age: Int, inDeg: Int, outDeg: Int) // Transform the graph val userGraph = graph.mapVertices{ case (id, (name, age)) = User(name, age, 0, 0) } {code} The following block of code works: {code} userGraph.vertices.count {code} and the following block of code generates a Kryo error: {code} userGraph.vertices.collect {code} There error: {code} java.lang.StackOverflowError at sun.reflect.UnsafeFieldAccessorImpl.ensureObj(UnsafeFieldAccessorImpl.java:54) at sun.reflect.UnsafeQualifiedObjectFieldAccessorImpl.get(UnsafeQualifiedObjectFieldAccessorImpl.java:38) at java.lang.reflect.Field.get(Field.java:379) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:552) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1518: --- Assignee: Colin Patrick McCabe Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009145#comment-14009145 ] Patrick Wendell commented on SPARK-983: --- We are actually looking at this problem in a few different places in the code base (we did this already for the external aggregations, and we also have SPARK-1777). Relying on GC's to decide when to spill is an interesting approach, but I'd rather have control of the heuristics ourselves. I think you'd get this thrashing behavior where a GC occurred and suddenly a million threads start writing to disk. In the past we've used a different mechanism (the size estimator) which approximates memory usage. It might make sense to introduce a simple memory allocation mechanism that is shared between the external aggregation maps, partition unrolling, etc. This is something where a design doc would be helpful. Support external sorting for RDD#sortByKey() Key: SPARK-983 URL: https://issues.apache.org/jira/browse/SPARK-983 Project: Spark Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Reynold Xin Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a buffer to hold the entire partition, then sorts it. This will cause an OOM if an entire partition cannot fit in memory, which is especially problematic for skewed data. Rather than OOMing, the behavior should be similar to the [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1784) Add a partitioner which partitions an RDD with each partition having specified # of keys
[ https://issues.apache.org/jira/browse/SPARK-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009184#comment-14009184 ] Patrick Wendell commented on SPARK-1784: I think this might be subsumed by the fix to SPARK-1770. https://github.com/apache/spark/pull/727/files Add a partitioner which partitions an RDD with each partition having specified # of keys Key: SPARK-1784 URL: https://issues.apache.org/jira/browse/SPARK-1784 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 0.9.0 Reporter: Syed A. Hashmi Priority: Minor Fix For: 1.0.0 At times on mailing lists, I have seen people complaining about having no control over # of keys per partition. RangePartitioner partitions keys in to roughly equal sized partitions, but in cases where user wants full control over specifying exact size, it is not possible today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1931) Graph.partitionBy does not reconstruct routing tables
[ https://issues.apache.org/jira/browse/SPARK-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-1931. --- Resolution: Fixed Fix Version/s: 1.0.0 Graph.partitionBy does not reconstruct routing tables - Key: SPARK-1931 URL: https://issues.apache.org/jira/browse/SPARK-1931 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Reporter: Ankur Dave Assignee: Ankur Dave Fix For: 1.0.0 Commit 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. This causes the following test to fail: {code} val g = Graph( sc.parallelize(List((0L, a), (1L, b), (2L, c))), sc.parallelize(List(Edge(0L, 1L, 1), Edge(0L, 2L, 1)), 2)) assert(g.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) val gPart = g.partitionBy(EdgePartition2D) assert(gPart.triplets.collect.map(_.toTuple).toSet === Set(((0L, a), (1L, b), 1), ((0L, a), (2L, c), 1))) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1925) Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid
[ https://issues.apache.org/jira/browse/SPARK-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-1925. - Resolution: Fixed Fix Version/s: 1.0.0 Typo in org.apache.spark.mllib.tree.DecisionTree.isSampleValid -- Key: SPARK-1925 URL: https://issues.apache.org/jira/browse/SPARK-1925 Project: Spark Issue Type: Bug Components: MLlib Reporter: Shixiong Zhu Priority: Minor Labels: easyfix Fix For: 1.0.0 I believe this is a typo: {code} if ((level 0) (parentFilters.length == 0)) { return false } {code} Should use here. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1932) Race conditions in BlockManager.cachedPeers and ConnectionManager.onReceiveCallback
Shixiong Zhu created SPARK-1932: --- Summary: Race conditions in BlockManager.cachedPeers and ConnectionManager.onReceiveCallback Key: SPARK-1932 URL: https://issues.apache.org/jira/browse/SPARK-1932 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Shixiong Zhu Priority: Minor BlockManager.cachedPeers and ConnectionManager.onReceiveCallback are read and written in different threads without proper protection. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1933) FileNotFoundException when a directory is passed to SparkContext.addJar/addFile
Reynold Xin created SPARK-1933: -- Summary: FileNotFoundException when a directory is passed to SparkContext.addJar/addFile Key: SPARK-1933 URL: https://issues.apache.org/jira/browse/SPARK-1933 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.0.1 When SparkContext.addJar/addFile is used to add a directory (which is not supported), the runtime exception is {code} java.io.FileNotFoundException: [file] (No such file or directory) {code} This exception is extremely confusing because the directory does exist. We should throw a more meaningful exception when a directory is passed to addJar/addFile. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1933) FileNotFoundException when a directory is passed to SparkContext.addJar/addFile
[ https://issues.apache.org/jira/browse/SPARK-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009239#comment-14009239 ] Reynold Xin commented on SPARK-1933: Pull request added https://github.com/apache/spark/pull/888 FileNotFoundException when a directory is passed to SparkContext.addJar/addFile --- Key: SPARK-1933 URL: https://issues.apache.org/jira/browse/SPARK-1933 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.0.1 When SparkContext.addJar/addFile is used to add a directory (which is not supported), the runtime exception is {code} java.io.FileNotFoundException: [file] (No such file or directory) {code} This exception is extremely confusing because the directory does exist. We should throw a more meaningful exception when a directory is passed to addJar/addFile. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1935) Set commons-codec 1.4 as a dependency
Yin Huai created SPARK-1935: --- Summary: Set commons-codec 1.4 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Priority: Minor Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009283#comment-14009283 ] Patrick Wendell edited comment on SPARK-1935 at 5/27/14 5:11 AM: - Does commons-codec 1.4 really break compatibility with commons-codec 1.3? Or is the issue just that Hadoop is compiled against 1.4 but maven is selecting 1.3 and so the new 1.4 functions aren't available. Also, would you mind giving the exact permutation of the build that is causing this error? I just want to see if it's also a problem in Spark 1.0 or if it's only in 0.9. was (Author: pwendell): Does commons-codec 1.4 really break compatibility with commons-codec 1.3? Or is the issue just that Hadoop is compiled against 1.4 but maven is selecting 1.3 and so the new 1.4 functions aren't available. Explicitly add commons-codec 1.4 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Priority: Minor Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009284#comment-14009284 ] Reynold Xin commented on SPARK-1935: If you build Spark with Maven, commons-codec 1.3 is included. If you build Spark with SBT, commons-codec 1.4 is included. Hive uses a Base64 decode([String]) that is introduced in 1.4. Explicitly add commons-codec 1.4 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Priority: Minor Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
[ https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009293#comment-14009293 ] Aaron Davidson commented on SPARK-1855: --- I agree that significant improvements can be made to Spark's block replication model, but there's no reason it shouldn't work (albeit with potentially poor write performance and fewer guarantees than one would like) if you increase the replication level higher than 2, which is possible using [StorageLevel#apply|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L155]. Provide memory-and-local-disk RDD checkpointing --- Key: SPARK-1855 URL: https://issues.apache.org/jira/browse/SPARK-1855 Project: Spark Issue Type: New Feature Components: MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Xiangrui Meng Checkpointing is used to cut long lineage while maintaining fault tolerance. The current implementation is HDFS-based. Using the BlockRDD we can create in-memory-and-local-disk (with replication) checkpoints that are not as reliable as HDFS-based solution but faster. It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1935) Explicitly add commons-codec 1.4 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009295#comment-14009295 ] Yin Huai commented on SPARK-1935: - Thanks, [~rxin]. Let me add more info. Commands I used: {code} mvn clean -DskipTests clean package {code} {code} sbt/sbt assembly {code} You can also check the pre-built Hadoop 1 package which has the 1.3 codec. There are a few methods in the class of Base64 that were introduced with 1.4 (http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html). I noticed the problem when Hive was calling {code} public static byte[] decodeBase64(String base64String) {code} Explicitly add commons-codec 1.4 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Priority: Minor Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)