Re: Python vs Scala performance
Interesting thread Marius, Btw, I'm curious about your cluster size. How small it is in terms of ram and cores. Arian 2014-10-22 13:17 GMT+01:00 Nicholas Chammas nicholas.cham...@gmail.com: Total guess without knowing anything about your code: Do either of these two notes from the 1.1.0 release notes http://spark.apache.org/releases/spark-release-1-1-0.html affect things at all? - PySpark now performs external spilling during aggregations. Old behavior can be restored by setting spark.shuffle.spill to false. - PySpark uses a new heuristic for determining the parallelism of shuffle operations. Old behavior can be restored by setting spark.default.parallelism to the number of cores in the cluster. Nick On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier mps@gmail.com wrote: We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but not that... On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com wrote: What version of Spark are you running? Some recent changes https://spark.apache.org/releases/spark-release-1-1-0.html to how PySpark works relative to Scala Spark may explain things. PySpark should not be that much slower, not by a stretch. On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab as...@live.com wrote: I'm no expert, but looked into how the python bits work a while back (was trying to assess what it would take to add F# support). It seems python hosts a jvm inside of it, and talks to scala spark in that jvm. The python server bit translates the python calls to those in the jvm. The python spark context is like an adapter to the jvm spark context. If you're seeing performance discrepancies, this might be the reason why. If the code can be organised to require fewer interactions with the adapter, that may improve things. Take this with a pinch of salt...I might be way off on this :) Cheers, Ashic. From: mps@gmail.com Subject: Python vs Scala performance Date: Wed, 22 Oct 2014 12:00:41 +0200 To: user@spark.apache.org Hi there, we have a small Spark cluster running and are processing around 40 GB of Gzip-compressed JSON data per day. I have written a couple of word count-like Scala jobs that essentially pull in all the data, do some joins, group bys and aggregations. A job takes around 40 minutes to complete. Now one of the data scientists on the team wants to do write some jobs using Python. To learn Spark, he rewrote one of my Scala jobs in Python. From the API-side, everything looks more or less identical. However his jobs take between 5-8 hours to complete! We can also see that the execution plan is quite different, I’m seeing writes to the output much later than in Scala. Is Python I/O really that slow? Thanks - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
That's true Guillaume. I'm currently aggregating documents considering a week as time range. I will have to make it daily and aggregate the results later. thanks for your hints anyway Arian Pasquali http://about.me/arianpasquali 2014-10-20 13:53 GMT+01:00 Guillaume Pitel guillaume.pi...@exensa.com: Hi, The array size you (or the serializer) tries to allocate is just too big for the JVM. No configuration can help : https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit The only option is to split you problem further by increasing parallelism. Guillaume Hi, I’m using Spark 1.1.0 and I’m having some issues to setup memory options. I get “Requested array size exceeds VM limit” and I’m probably missing something regarding memory configuration https://spark.apache.org/docs/1.1.0/configuration.html. My server has 30G of memory and this are my current settings. ##this one seams that was deprecated export SPARK_MEM=‘25g’ ## worker memory options seams to be the memory for each worker (by default we have a worker for each core) export SPARK_WORKER_MEMORY=‘5g’ I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS, but I’m not quite sure how. I have tried some different options like the following, but I still couldn’t make it right: export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' Does anyone has any idea how can I approach this? 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1566 non-empty blocks out of 1566 blocks 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (1 time so far) 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (2 times so far) 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 1566) java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140 Arian -- [image: eXenSa] *Guillaume PITEL, Président* +33(0)626 222 431 eXenSa S.A.S. http://www.exensa.com/ 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Hi, I’m using Spark 1.1.0 and I’m having some issues to setup memory options. I get “Requested array size exceeds VM limit” and I’m probably missing something regarding memory configuration (https://spark.apache.org/docs/1.1.0/configuration.html). My server has 30G of memory and this are my current settings. ##this one seams that was deprecated export SPARK_MEM=‘25g’ ## worker memory options seams to be the memory for each worker (by default we have a worker for each core) export SPARK_WORKER_MEMORY=‘5g’ I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS, but I’m not quite sure how. I have tried some different options like the following, but I still couldn’t make it right: export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' Does anyone has any idea how can I approach this? 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1566 non-empty blocks out of 1566 blocks 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (1 time so far) 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (2 times so far) 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 1566) java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140 Arian
Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Hi Akhil, thanks for your help but I was originally running without xmx option. With that I was just trying to push the limit of my heap size, but obviously doing it wrong. Arian Pasquali http://about.me/arianpasquali 2014-10-20 12:24 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com: Hi Arian, You will get this exception because you are trying to create an array that is larger than the maximum contiguous block of memory in your Java VMs heap. Here since you are setting Worker memory as *5Gb* and you are exporting the *_OPTS as *8Gb*, your application actually thinks it has 8Gb of memory where as it only has 5Gb and hence it exceeds the VM Limit. Thanks Best Regards On Mon, Oct 20, 2014 at 4:42 PM, Arian Pasquali ar...@arianpasquali.com wrote: Hi, I’m using Spark 1.1.0 and I’m having some issues to setup memory options. I get “Requested array size exceeds VM limit” and I’m probably missing something regarding memory configuration https://spark.apache.org/docs/1.1.0/configuration.html. My server has 30G of memory and this are my current settings. ##this one seams that was deprecated export SPARK_MEM=‘25g’ ## worker memory options seams to be the memory for each worker (by default we have a worker for each core) export SPARK_WORKER_MEMORY=‘5g’ I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS, but I’m not quite sure how. I have tried some different options like the following, but I still couldn’t make it right: export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops' Does anyone has any idea how can I approach this? 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1566 non-empty blocks out of 1566 blocks 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (1 time so far) 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map of 3925 MB to disk (2 times so far) 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 1566) java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140 Arian