Re: Python vs Scala performance

2014-10-22 Thread Arian Pasquali
Interesting thread Marius,
Btw, I'm curious about your cluster size.
How small it is in terms of ram and cores.

Arian

2014-10-22 13:17 GMT+01:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Total guess without knowing anything about your code: Do either of these
 two notes from the 1.1.0 release notes
 http://spark.apache.org/releases/spark-release-1-1-0.html affect things
 at all?


- PySpark now performs external spilling during aggregations. Old
behavior can be restored by setting spark.shuffle.spill to false.
- PySpark uses a new heuristic for determining the parallelism of
shuffle operations. Old behavior can be restored by setting
spark.default.parallelism to the number of cores in the cluster.

  Nick
 ​

 On Wed, Oct 22, 2014 at 7:29 AM, Marius Soutier mps@gmail.com wrote:

 We’re using 1.1.0. Yes I expected Scala to be maybe twice as fast, but
 not that...

 On 22.10.2014, at 13:02, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 What version of Spark are you running? Some recent changes
 https://spark.apache.org/releases/spark-release-1-1-0.html to how
 PySpark works relative to Scala Spark may explain things.

 PySpark should not be that much slower, not by a stretch.

 On Wed, Oct 22, 2014 at 6:11 AM, Ashic Mahtab as...@live.com wrote:

 I'm no expert, but looked into how the python bits work a while back
 (was trying to assess what it would take to add F# support). It seems
 python hosts a jvm inside of it, and talks to scala spark in that jvm.
 The python server bit translates the python calls to those in the jvm.
 The python spark context is like an adapter to the jvm spark context. If
 you're seeing performance discrepancies, this might be the reason why. If
 the code can be organised to require fewer interactions with the adapter,
 that may improve things. Take this with a pinch of salt...I might be way
 off on this :)

 Cheers,
 Ashic.

  From: mps@gmail.com
  Subject: Python vs Scala performance
  Date: Wed, 22 Oct 2014 12:00:41 +0200
  To: user@spark.apache.org

 
  Hi there,
 
  we have a small Spark cluster running and are processing around 40 GB
 of Gzip-compressed JSON data per day. I have written a couple of word
 count-like Scala jobs that essentially pull in all the data, do some joins,
 group bys and aggregations. A job takes around 40 minutes to complete.
 
  Now one of the data scientists on the team wants to do write some jobs
 using Python. To learn Spark, he rewrote one of my Scala jobs in Python.
 From the API-side, everything looks more or less identical. However his
 jobs take between 5-8 hours to complete! We can also see that the execution
 plan is quite different, I’m seeing writes to the output much later than in
 Scala.
 
  Is Python I/O really that slow?
 
 
  Thanks
  - Marius
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 







Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-21 Thread Arian Pasquali
That's true Guillaume.
I'm currently aggregating documents considering a week as time range.
I will have to make it daily and aggregate the results later.

thanks for your hints anyway




Arian Pasquali
http://about.me/arianpasquali

2014-10-20 13:53 GMT+01:00 Guillaume Pitel guillaume.pi...@exensa.com:

  Hi,

 The array size you (or the serializer) tries to allocate is just too big
 for the JVM. No configuration can help :

 https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit

 The only option is to split you problem further by increasing parallelism.

 Guillaume

 Hi,
 I’m using Spark 1.1.0 and I’m having some issues to setup memory options.
 I get “Requested array size exceeds VM limit” and I’m probably missing
 something regarding memory configuration
 https://spark.apache.org/docs/1.1.0/configuration.html.

  My server has 30G of memory and this are my current settings.

  ##this one seams that was deprecated
  export SPARK_MEM=‘25g’

  ## worker memory options seams to be the memory for each worker (by
 default we have a worker for each core)
 export SPARK_WORKER_MEMORY=‘5g’

  I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS,
 but I’m not quite sure how.
 I have tried some different options like the following, but I still
 couldn’t make it right:

  export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'
 export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'

  Does anyone has any idea how can I approach this?




  14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 1566 non-empty blocks out of 1566 blocks
 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 4 ms
 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory
 map of 3925 MB to disk (1 time so far)
 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory
 map of 3925 MB to disk (2 times so far)
 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 1566)
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception in thread Thread[Executor task launch worker-2,5,main]
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140


  Arian



 --
[image: eXenSa]
  *Guillaume PITEL, Président*
 +33(0)626 222 431

 eXenSa S.A.S. http://www.exensa.com/
  41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705



java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Arian Pasquali
Hi,  
I’m using Spark 1.1.0 and I’m having some issues to setup memory options.
I get “Requested array size exceeds VM limit” and I’m probably missing 
something regarding memory configuration 
(https://spark.apache.org/docs/1.1.0/configuration.html).

My server has 30G of memory and this are my current settings.  

##this one seams that was deprecated
export SPARK_MEM=‘25g’  

## worker memory options seams to be the memory for each worker (by default we 
have a worker for each core)
export SPARK_WORKER_MEMORY=‘5g’  

I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS, but I’m 
not quite sure how.
I have tried some different options like the following, but I still couldn’t 
make it right:

export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'
export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'


Does anyone has any idea how can I approach this?




14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
1566 non-empty blocks out of 1566 blocks
14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 4 ms
14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map 
of 3925 MB to disk (1 time so far)
14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory map 
of 3925 MB to disk (2 times so far)
14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 1566)
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker-2,5,main]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140


Arian

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Arian Pasquali
Hi Akhil,
thanks for your help

but I was originally running without xmx option. With that I was just
trying to push the limit of my heap size, but obviously doing it wrong.




Arian Pasquali
http://about.me/arianpasquali

2014-10-20 12:24 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:

 Hi Arian,

 You will get this exception because you are trying to create an array that
 is larger than the maximum contiguous block of memory in your Java VMs heap.

 Here since you are setting Worker memory as *5Gb* and you are exporting
 the *_OPTS as *8Gb*, your application actually thinks it has 8Gb of
 memory where as it only has 5Gb and hence it exceeds the VM Limit.



 Thanks
 Best Regards

 On Mon, Oct 20, 2014 at 4:42 PM, Arian Pasquali ar...@arianpasquali.com
 wrote:

 Hi,
 I’m using Spark 1.1.0 and I’m having some issues to setup memory options.
 I get “Requested array size exceeds VM limit” and I’m probably missing
 something regarding memory configuration
 https://spark.apache.org/docs/1.1.0/configuration.html.

 My server has 30G of memory and this are my current settings.

 ##this one seams that was deprecated
 export SPARK_MEM=‘25g’

 ## worker memory options seams to be the memory for each worker (by
 default we have a worker for each core)
 export SPARK_WORKER_MEMORY=‘5g’

 I probably need to specify some options using SPARK_DAEMON_JAVA_OPTS, but
 I’m not quite sure how.
 I have tried some different options like the following, but I still
 couldn’t make it right:

 export SPARK_DAEMON_JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'
 export JAVA_OPTS='-Xmx8G -XX:+UseCompressedOops'

 Does anyone has any idea how can I approach this?




 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 maxBytesInFlight: 50331648, targetRequestSize: 10066329
 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Getting 1566 non-empty blocks out of 1566 blocks
 14/10/11 13:00:16 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
 Started 0 remote fetches in 4 ms
 14/10/11 13:02:06 INFO ExternalAppendOnlyMap: Thread 63 spilling
 in-memory map of 3925 MB to disk (1 time so far)
 14/10/11 13:05:17 INFO ExternalAppendOnlyMap: Thread 63 spilling
 in-memory map of 3925 MB to disk (2 times so far)
 14/10/11 13:09:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 1566)
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/10/11 13:09:15 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception in thread Thread[Executor task launch worker-2,5,main]
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140


 Arian