Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Martin Goodson
Have you tried to repartition() your original data to make more partitions
before you aggregate?


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]

On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 Hi Yin,

 Yes, I have set spark.executor.memory to 8g and the worker memory to 16g
 without any success.
 I cannot figure out how to increase the number of mapPartitions tasks.

 Thanks a lot

 On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote:

 spark.sql.shuffle.partitions only control the number of tasks in the
 second stage (the number of reducers). For your case, I'd say that the
 number of tasks in the first state (number of mappers) will be the number
 of files you have.

 Actually, have you changed spark.executor.memory (it controls the
 memory for an executor of your application)? I did not see it in your
 original email. The difference between worker memory and executor memory
 can be found at (http://spark.apache.org/docs/1.3.0/spark-standalone.html
 ),

 SPARK_WORKER_MEMORY
 Total amount of memory to allow Spark applications to use on the machine,
 e.g. 1000m, 2g (default: total memory minus 1 GB); note that each
 application's individual memory is configured using its
 spark.executor.memory property.


 On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Actually I realized that the correct way is:

 sqlContext.sql(set spark.sql.shuffle.partitions=1000)

 but I am still experiencing the same behavior/error.

 On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote:

 Hi Yin,

 the way I set the configuration is:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext.setConf(spark.sql.shuffle.partitions,1000);

 it is the correct way right?
 In the mapPartitions task (the first task which is launched), I get
 again the same number of tasks and again the same error. :(

 Thanks a lot!

 On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 thanks a lot for that! Will give it a shot and let you know.

 On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote:

 Was the OOM thrown during the execution of first stage (map) or the
 second stage (reduce)? If it was the second stage, can you increase the
 value of spark.sql.shuffle.partitions and see if the OOM disappears?

 This setting controls the number of reduces Spark SQL will use and
 the default is 200. Maybe there are too many distinct values and the 
 memory
 pressure on every task (of those 200 reducers) is pretty high. You can
 start with 400 and increase it until the OOM disappears. Hopefully this
 will help.

 Thanks,

 Yin


 On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi Yin,

 Thanks for your feedback. I have 1700 parquet files, sized 100MB
 each. The number of tasks launched is equal to the number of parquet 
 files.
 Do you have any idea on how to deal with this situation?

 Thanks a lot
 On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote:

 Seems there are too many distinct groups processed in a task, which
 trigger the problem.

 How many files do your dataset have and how large is a file? Seems
 your query will be executed with two stages, table scan and map-side
 aggregation in the first stage and the final round of reduce-side
 aggregation in the second stage. Can you take a look at the numbers of
 tasks launched in these two stages?

 Thanks,

 Yin

 On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi there, I set the executor memory to 8g but it didn't help

 On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com
 wrote:

 You should probably increase executor memory by setting
 spark.executor.memory.

 Full list of available configurations can be found here
 http://spark.apache.org/docs/latest/configuration.html

 Cheng


 On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:

 Hi there,

 I was trying the new DataFrame API with some basic operations on
 a parquet dataset.
 I have 7 nodes of 12 cores and 8GB RAM allocated to each worker
 in a standalone cluster mode.
 The code is the following:

 val people = sqlContext.parquetFile(/data.parquet);
 val res = people.groupBy(name,date).
 agg(sum(power),sum(supply)).take(10);
 System.out.println(res);

 The dataset consists of 16 billion entries.
 The error I get is java.lang.OutOfMemoryError: GC overhead limit
 exceeded

 My configuration is:

 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.driver.memory6g
 spark.executor.extraJavaOptions -XX:+UseCompressedOops
 spark.shuffle.managersort

 Any idea how can I workaround this?

 Thanks a lot













Job using Spark for Machine Learning

2014-07-29 Thread Martin Goodson
I'm not sure if job adverts are allowed on here - please let me know if
not.

Otherwise, if you're interested in using Spark in an RD machine learning
project then please get in touch. We are a startup based in London.

Our data sets are on a massive scale- we collect data on over a billion
users per month and are second only to Google in the contextual advertising
space (ok - a distant second!).

Details here:
*http://grnh.se/rl8f25 http://grnh.se/rl8f25*

-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Thank you Nishkam,
I have read your code. So, for the sake of my understanding, it seems that
for each spark context there is one executor per node? Can anyone confirm
this?


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in Spark.
 Takes as input high-level parameters (like number of nodes, cores per node,
 memory per node, etc) and spits out default configuration, user advice and
 command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this over
 time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might set
 spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves) each
 with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
 want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set of
 executors across the cluster.  So for example if you have two shells open,
 they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely
 read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 other memory settings available (daemon memory, worker memory etc). Perhaps
 a worked example could be added to the docs? I would be happy to provide
 some text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]







Re: Configuring Spark Memory

2014-07-24 Thread Martin Goodson
Great - thanks for the clarification Aaron. The offer stands for me to
write some documentation and an example that covers this without leaving
*any* room for ambiguity.




-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson ilike...@gmail.com wrote:

 Whoops, I was mistaken in my original post last year. By default, there is
 one executor per node per Spark Context, as you said.
 spark.executor.memory is the amount of memory that the application
 requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
 memory a Spark Worker is willing to allocate in executors.

 So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your
 cluster, and spark.executor.memory to 4g, you would be able to run 2
 simultaneous Spark Contexts who get 4g per node. Similarly, if
 spark.executor.memory were 8g, you could only run 1 Spark Context at a time
 on the cluster, but it would get all the cluster's memory.


 On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thank you Nishkam,
 I have read your code. So, for the sake of my understanding, it seems
 that for each spark context there is one executor per node? Can anyone
 confirm this?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi nr...@cloudera.com wrote:

 See if this helps:

 https://github.com/nishkamravi2/SparkAutoConfig/

 It's a very simple tool for auto-configuring default parameters in
 Spark. Takes as input high-level parameters (like number of nodes, cores
 per node, memory per node, etc) and spits out default configuration, user
 advice and command line. Compile (javac SparkConfigure.java) and run (java
 SparkConfigure).

 Also cc'ing dev in case others are interested in helping evolve this
 over time (by refining the heuristics and adding more parameters).


  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 Thanks Andrew,

 So if there is only one SparkContext there is only one executor per
 machine? This seems to contradict Aaron's message from the link above:

 If each machine has 16 GB of RAM and 4 cores, for example, you might
 set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
 Spark.)

 Am I reading this incorrectly?

 Anyway our configuration is 21 machines (one master and 20 slaves) each
 with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
 want to leave say 16Gb on each machine for python processes.

 Thanks again for the advice!



 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]


 On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com
 wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set
 of executors across the cluster.  So for example if you have two shells
 open, they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons 
 (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have
 and how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely
 read the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors 
 on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to
 the *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad

Configuring Spark Memory

2014-07-23 Thread Martin Goodson
We are having difficulties configuring Spark, partly because we still don't
understand some key concepts. For instance, how many executors are there
per machine in standalone mode? This is after having closely read the
documentation several times:

*http://spark.apache.org/docs/latest/configuration.html
http://spark.apache.org/docs/latest/configuration.html*
*http://spark.apache.org/docs/latest/spark-standalone.html
http://spark.apache.org/docs/latest/spark-standalone.html*
*http://spark.apache.org/docs/latest/tuning.html
http://spark.apache.org/docs/latest/tuning.html*
*http://spark.apache.org/docs/latest/cluster-overview.html
http://spark.apache.org/docs/latest/cluster-overview.html*

The cluster overview has some information here about executors but is
ambiguous about whether there are single executors or multiple executors on
each machine.

 This message from Aaron Davidson implies that the executor memory should
be set to total available memory on the machine divided by the number of
cores:
*http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

But other messages imply that the executor memory should be set to the
*total* available memory of each machine.

We would very much appreciate some clarity on this and the myriad of other
memory settings available (daemon memory, worker memory etc). Perhaps a
worked example could be added to the docs? I would be happy to provide some
text as soon as someone can enlighten me on the technicalities!

Thank you

-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


Re: Configuring Spark Memory

2014-07-23 Thread Martin Goodson
Thanks Andrew,

So if there is only one SparkContext there is only one executor per
machine? This seems to contradict Aaron's message from the link above:

If each machine has 16 GB of RAM and 4 cores, for example, you might set
spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Spark.)

Am I reading this incorrectly?

Anyway our configuration is 21 machines (one master and 20 slaves) each
with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
want to leave say 16Gb on each machine for python processes.

Thanks again for the advice!



-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Martin,

 In standalone mode, each SparkContext you initialize gets its own set of
 executors across the cluster.  So for example if you have two shells open,
 they'll each get two JVMs on each worker machine in the cluster.

 As far as the other docs, you can configure the total number of cores
 requested for the SparkContext, the amount of memory for the executor JVM
 on each machine, the amount of memory for the Master/Worker daemons (little
 needed since work is done in executors), and several other settings.

 Which of those are you interested in?  What spec hardware do you have and
 how do you want to configure it?

 Andrew


 On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com
 wrote:

 We are having difficulties configuring Spark, partly because we still
 don't understand some key concepts. For instance, how many executors are
 there per machine in standalone mode? This is after having closely read
 the documentation several times:

 *http://spark.apache.org/docs/latest/configuration.html
 http://spark.apache.org/docs/latest/configuration.html*
 *http://spark.apache.org/docs/latest/spark-standalone.html
 http://spark.apache.org/docs/latest/spark-standalone.html*
 *http://spark.apache.org/docs/latest/tuning.html
 http://spark.apache.org/docs/latest/tuning.html*
 *http://spark.apache.org/docs/latest/cluster-overview.html
 http://spark.apache.org/docs/latest/cluster-overview.html*

 The cluster overview has some information here about executors but is
 ambiguous about whether there are single executors or multiple executors on
 each machine.

  This message from Aaron Davidson implies that the executor memory
 should be set to total available memory on the machine divided by the
 number of cores:
 *http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E
 http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCANGvG8o5K1SxgnFMT_9DK=vj_plbve6zh_dn5sjwpznpbcp...@mail.gmail.com%3E*

 But other messages imply that the executor memory should be set to the
 *total* available memory of each machine.

 We would very much appreciate some clarity on this and the myriad of
 other memory settings available (daemon memory, worker memory etc). Perhaps
 a worked example could be added to the docs? I would be happy to provide
 some text as soon as someone can enlighten me on the technicalities!

 Thank you

 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]





Re: Problem running Spark shell (1.0.0) on EMR

2014-07-22 Thread Martin Goodson
I am also having exactly the same problem, calling using pyspark. Has
anyone managed to get this script to work?


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Wed, Jul 16, 2014 at 2:10 PM, Ian Wilkinson ia...@me.com wrote:

 Hi,

 I’m trying to run the Spark (1.0.0) shell on EMR and encountering a
 classpath issue.
 I suspect I’m missing something gloriously obviously, but so far it is
 eluding me.

 I launch the EMR Cluster (using the aws cli) with:

 aws emr create-cluster --name Test Cluster  \
 --ami-version 3.0.3 \
 --no-auto-terminate \
 --ec2-attributes KeyName=... \
 --bootstrap-actions
 Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \
 --instance-groups
 InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium  \
 InstanceGroupType=CORE,InstanceCount=1,InstanceType=m1.medium
 --region eu-west-1

 then,

 $ aws emr ssh --cluster-id ... --key-pair-file ... --region eu-west-1

 On the master node, I then launch the shell with:

 [hadoop@ip-... spark]$ ./bin/spark-shell

 and try performing:

 scala val logs = sc.textFile(s3n://.../“)

 this produces:

 14/07/16 12:40:35 WARN storage.BlockManager: Putting block broadcast_0
 failed
 java.lang.NoSuchMethodError:
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;


 Any help mighty welcome,
 ian




Re: Spark vs Google cloud dataflow

2014-06-27 Thread Martin Goodson
My experience is that gaining 20 spot instances accounts for a tiny
fraction of the total time of provisioning a cluster with spark-ec2. This
is not (solely) an AWS issue.


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Thu, Jun 26, 2014 at 10:14 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Hmm, I remember a discussion on here about how the way in which spark-ec2
 rsyncs stuff to the cluster for setup could be improved, and I’m assuming
 there are other such improvements to be made. Perhaps those improvements
 don’t matter much when compared to EC2 instance launch times, but I’m not
 sure.
 ​


 On Thu, Jun 26, 2014 at 4:48 PM, Aureliano Buendia buendia...@gmail.com
 wrote:




 On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:


 That’s technically true, but I’d be surprised if there wasn’t a lot of
 room for improvement in spark-ec2 regarding cluster launch+config
 times.

 Unfortunately, this is a spark support issue, but an AWS one. Starting a
 few months ago, Amazon AWS services have been having bigger and bigger
 lags. Indeed, the default timeout hard coded  in spark-ec2 is no longer
 able to launch the cluster successfully, and many people here reported that
 they had to increase it.


 ​






Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Martin Goodson
How about London?


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski andykonwin...@gmail.comwrote:

 Hi folks,

 We have seen a lot of community growth outside of the Bay Area and we are
 looking to help spur even more!

 For starters, the organizers of the Spark meetups here in the Bay Area
 want to help anybody that is interested in setting up a meetup in a new
 city.

 Some amazing Spark champions have stepped forward in Seattle, Vancouver,
 Boulder/Denver, and a few other areas already.

 Right now, we are looking to connect with you Spark enthusiasts in NYC
 about helping to run an inaugural Spark Meetup in your area.

 You can reply to me directly if you are interested and I can tell you
 about all of the resources we have to offer (speakers from the core
 community, a budget for food, help scheduling, etc.), and let's make this
 happen!

 Andy

inline: image.png