[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289677#comment-14289677 ] Alexander Ulanov commented on SPARK-5386: - My spark-env.sh contains: export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY=8g export SPARK_WORKER_INSTANCES=2 I run spark-shell with ./spark-shell --executor-memory 8G --driver-memory 8G. In Spark-UI each worker has 8GB of memory. Btw., I run this code once again and this time it does not crash and keep trying to shedule the job for the failing node that tries to allocate memory and fails and so on. Is it a normal behavior? Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289607#comment-14289607 ] Sean Owen commented on SPARK-5386: -- Yes, you're creating ~5GB vectors and have at least 2 in memory at once. The error actually indicates your machine doesn't even have enough memory to store that much contiguously, let alone the Java heap. What's the Spark-specific issue? Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289633#comment-14289633 ] Sean Owen commented on SPARK-5386: -- You are allocating 8G for executors? or just the workers? standalone mode? Someone who knows a little more might be able to confirm or deny, but I think you're hitting trouble allocating such a large chunk of memory at once. It may be that there is enough heap but not all in one place, and making a huge dense vector means allocating a huge array of doubles. Or it could simply be a really abrupt out-of-memory condition, because in fact it's holding several of these vectors in memory at once and running out. Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289679#comment-14289679 ] Shivaram Venkataraman commented on SPARK-5386: -- Couple of things that might be worth inspecting 1. It might be interesting to see if this is a problem in `reduce` or in the `map` stage. i.e. Does running a count after the parallelize work ? 2. The error message indicates requesting around 2.3G of memory which seems to indicate that a bunch of these vectors are being created at once ? It'd be interesting to see what happens when say p = 2 in your script Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289621#comment-14289621 ] Alexander Ulanov commented on SPARK-5386: - I allocate 8G for driver and each worker. Could you suggest why it is not enough for handling reduce operation with 60M vector of Double? Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289731#comment-14289731 ] Shivaram Venkataraman commented on SPARK-5386: -- Note that having 2 worker instances and 2 cores per worker would make it 4 tasks per machine. And if the `count` works and `reduce` fails, then it looks like it has something to do with allocating extra vectors to hold the result in each partition ([1]) etc. I don't know much about the scala implementation of reduceLeft or ways to trace down where the memory allocations are coming from. [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L865 Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289708#comment-14289708 ] Alexander Ulanov commented on SPARK-5386: - Thank you for suggestions. 1. count() does work, it returns 12 2. It failed with p = 2. However, in some of my previous experiments it did not fail even for p up to 5 or 7 (in different runs) Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289880#comment-14289880 ] Alexander Ulanov commented on SPARK-5386: - Thank you, it might be the problem. I was trying to run GC before each operation but it did not help. Probably, it takes a lot of memory to run initialization of Breeze Dense Vector. Assuming that the problem is due to insufficient memory on the Worker node, I am curious, what will happen on Driver? Will it receive 12 vectors of size 60M Doubles and then do the aggregation? Is it feasible? (P.S. I know that there is a treeReduce function that forces do partial aggregation on Workers. However, for big number of Wokers the problem will remain in treeReduce as well, as far as I understand) Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length
[ https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289957#comment-14289957 ] Shivaram Venkataraman commented on SPARK-5386: -- Results are merged on the driver one at at time. You can see the merge function that is called right below at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L873 However I dont know if there is anything that limits the rate at which results are fetched etc. Reduce fails with vectors of big length --- Key: SPARK-5386 URL: https://issues.apache.org/jira/browse/SPARK-5386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Environment: Overall: 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers Spark: ./spark-shell --executor-memory 8G --driver-memory 8G spark.driver.maxResultSize 0 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space Reporter: Alexander Ulanov Fix For: 1.3.0 Code: import org.apache.spark.mllib.rdd.RDDFunctions._ import breeze.linalg._ import org.apache.log4j._ Logger.getRootLogger.setLevel(Level.OFF) val n = 6000 val p = 12 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n )) vv.count() vv.reduce(_ + _) When executing in shell it crashes after some period of time. One of the node contain the following in stdout: Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 2863661056 bytes for committing reserved memory. # An error report file with more information is saved as: # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log During the execution there is a message: Job aborted due to stage failure: Exception while getting task result: java.io.IOException: Connection from server-12.net/10.10.10.10:54701 closed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org