[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289677#comment-14289677
 ] 

Alexander Ulanov commented on SPARK-5386:
-

My spark-env.sh contains:
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=8g
export SPARK_WORKER_INSTANCES=2
I run spark-shell with ./spark-shell --executor-memory 8G --driver-memory 8G. 
In Spark-UI each worker has 8GB of memory. 

Btw., I run this code once again and this time it does not crash and keep 
trying to shedule the job for the failing node that tries to allocate memory 
and fails and so on. Is it a normal behavior?

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289607#comment-14289607
 ] 

Sean Owen commented on SPARK-5386:
--

Yes, you're creating ~5GB vectors and have at least 2 in memory at once. The 
error actually indicates your machine doesn't even have enough memory to store 
that much contiguously, let alone the Java heap. What's the Spark-specific 
issue?

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289633#comment-14289633
 ] 

Sean Owen commented on SPARK-5386:
--

You are allocating 8G for executors? or just the workers? standalone mode?

Someone who knows a little more might be able to confirm or deny, but I think 
you're hitting trouble allocating such a large chunk of memory at once. It may 
be that there is enough heap but not all in one place, and making a huge dense 
vector means allocating a huge array of doubles. Or it could simply be a really 
abrupt out-of-memory condition, because in fact it's holding several of these 
vectors in memory at once and running out.

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289679#comment-14289679
 ] 

Shivaram Venkataraman commented on SPARK-5386:
--

Couple of things that might be worth inspecting

1. It might be interesting to see if this is a problem in `reduce` or in the 
`map` stage. i.e. Does running a count after the parallelize work ?

2. The error message indicates requesting around 2.3G of memory which seems to 
indicate that a bunch of these vectors are being created at once ? It'd be 
interesting to see what happens when say p = 2 in your script

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289621#comment-14289621
 ] 

Alexander Ulanov commented on SPARK-5386:
-

I allocate 8G for driver and each worker. Could you suggest why it is not 
enough for handling reduce operation with 60M vector of Double?

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289731#comment-14289731
 ] 

Shivaram Venkataraman commented on SPARK-5386:
--

Note that having 2 worker instances and 2 cores per worker would make it 4 
tasks per machine. And if the `count` works and `reduce` fails, then it looks 
like it has something to do with allocating extra vectors to hold the result in 
each partition ([1]) etc. I don't know much about the scala implementation of 
reduceLeft or ways to trace down where the memory allocations are coming from.

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L865

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289708#comment-14289708
 ] 

Alexander Ulanov commented on SPARK-5386:
-

Thank you for suggestions.
1. count() does work, it returns 12
2. It failed with p = 2. However, in some of my previous experiments it did not 
fail even for p up to 5 or 7 (in different runs)

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289880#comment-14289880
 ] 

Alexander Ulanov commented on SPARK-5386:
-

Thank you, it might be the problem. I was trying to run GC before each 
operation but it did not help. Probably, it takes a lot of memory to run 
initialization of Breeze Dense Vector. Assuming that the problem is due to 
insufficient memory on the Worker node, I am curious, what will happen on 
Driver? Will it receive 12 vectors of size 60M Doubles and then do the 
aggregation? Is it feasible? (P.S. I know that there is a treeReduce function 
that forces do partial aggregation on Workers. However, for big number of 
Wokers the problem will remain in treeReduce as well, as far as I understand) 

 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5386) Reduce fails with vectors of big length

2015-01-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14289957#comment-14289957
 ] 

Shivaram Venkataraman commented on SPARK-5386:
--

Results are merged on the driver one at at time. You can see the merge function 
that is called right below at 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L873

However I dont know if there is anything that limits the rate at which results 
are fetched etc.


 Reduce fails with vectors of big length
 ---

 Key: SPARK-5386
 URL: https://issues.apache.org/jira/browse/SPARK-5386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Overall:
 6 machine cluster (Xeon 3.3GHz 4 cores, 16GB RAM, Ubuntu), each runs 2 Workers
 Spark:
 ./spark-shell --executor-memory 8G --driver-memory 8G
 spark.driver.maxResultSize 0
 java.io.tmpdir and spark.local.dir set to a disk with a lot of free space
Reporter: Alexander Ulanov
 Fix For: 1.3.0


 Code:
 import org.apache.spark.mllib.rdd.RDDFunctions._
 import breeze.linalg._
 import org.apache.log4j._
 Logger.getRootLogger.setLevel(Level.OFF)
 val n = 6000
 val p = 12
 val vv = sc.parallelize(0 until p, p).map(i = DenseVector.rand[Double]( n ))
 vv.count()
 vv.reduce(_ + _)
 When executing in shell it crashes after some period of time. One of the node 
 contain the following in stdout:
 Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
 os::commit_memory(0x00075550, 2863661056, 0) failed; error='Cannot 
 allocate memory' (errno=12)
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 2863661056 bytes for 
 committing reserved memory.
 # An error report file with more information is saved as:
 # /datac/spark/app-20150123091936-/89/hs_err_pid2247.log
 During the execution there is a message: Job aborted due to stage failure: 
 Exception while getting task result: java.io.IOException: Connection from 
 server-12.net/10.10.10.10:54701 closed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org