Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
 So increasing Executors without increasing physical resources
If I have a 16 GB RAM system and then I allocate 1 GB for each executor,
and give number of executors as 8, then I am increasing the resource right?
In this case, how do you explain?

Thank You

On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson ilike...@gmail.com wrote:

 Note that the parallelism (i.e., number of partitions) is just an upper
 bound on how much of the work can be done in parallel. If you have 200
 partitions, then you can divide the work among between 1 and 200 cores and
 all resources will remain utilized. If you have more than 200 cores,
 though, then some will not be used, so you would want to increase
 parallelism further. (There are other rules-of-thumb -- for instance, it's
 generally good to have at least 2x more partitions than cores for straggler
 mitigation, but these are essentially just optimizations.)

 Further note that when you increase the number of Executors for the same
 set of resources (i.e., starting 10 Executors on a single machine instead
 of 1), you make Spark's job harder. Spark has to communicate in an
 all-to-all manner across Executors for shuffle operations, and it uses TCP
 sockets to do so whether or not the Executors happen to be on the same
 machine. So increasing Executors without increasing physical resources
 means Spark has to do more communication to do the same work.

 We expect that increasing the number of Executors by a factor of 10, given
 an increase in the number of physical resources by the same factor, would
 also improve performance by 10x. This is not always the case for the
 precise reason above (increased communication overhead), but typically we
 can get close. The actual observed improvement is very algorithm-dependent,
 though; for instance, some ML algorithms become hard to scale out past a
 certain point because the increase in communication overhead outweighs the
 increase in parallelism.

 On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, if I keep the number of instances constant and increase the degree of
 parallelism in steps, can I expect the performance to increase?

 Thank You

 On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always
 predict the performance of an app with the increase in number of worker
 instances, the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com
 wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a
 lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it
 is
   natural.
   Does message passing across the workers account for the
 happenning?
  
   I am running SparkKMeans, just to 

Re: Worker and Nodes

2015-02-21 Thread Aaron Davidson
Note that the parallelism (i.e., number of partitions) is just an upper
bound on how much of the work can be done in parallel. If you have 200
partitions, then you can divide the work among between 1 and 200 cores and
all resources will remain utilized. If you have more than 200 cores,
though, then some will not be used, so you would want to increase
parallelism further. (There are other rules-of-thumb -- for instance, it's
generally good to have at least 2x more partitions than cores for straggler
mitigation, but these are essentially just optimizations.)

Further note that when you increase the number of Executors for the same
set of resources (i.e., starting 10 Executors on a single machine instead
of 1), you make Spark's job harder. Spark has to communicate in an
all-to-all manner across Executors for shuffle operations, and it uses TCP
sockets to do so whether or not the Executors happen to be on the same
machine. So increasing Executors without increasing physical resources
means Spark has to do more communication to do the same work.

We expect that increasing the number of Executors by a factor of 10, given
an increase in the number of physical resources by the same factor, would
also improve performance by 10x. This is not always the case for the
precise reason above (increased communication overhead), but typically we
can get close. The actual observed improvement is very algorithm-dependent,
though; for instance, some ML algorithms become hard to scale out past a
certain point because the increase in communication overhead outweighs the
increase in parallelism.

On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 So, if I keep the number of instances constant and increase the degree of
 parallelism in steps, can I expect the performance to increase?

 Thank You

 On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always predict
 the performance of an app with the increase in number of worker instances,
 the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com
 wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I
 am
   using
   several data sets. I have a standalone mode. I am varying the
 workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer
 tasks to
   execute locally (?) So you may be execute more remotely.
  
   Are you 

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
Also, If I take SparkPageRank for example (org.apache.spark.examples),
there are various RDDs that are created and transformed in the code that is
written. If I want to increase the number of partitions and test out, what
is the optimum number of partitions that gives me the best performance, I
have to change the number of partitions in each run, right? Now, there are
various RDDs there, so, which RDD do I partition? In other words, if I
partition the first RDD that is created from the data in HDFS, am I ensured
that other RDDs that are transformed from this RDD will also be partitioned
in the same way?

Thank You

On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

  So increasing Executors without increasing physical resources
 If I have a 16 GB RAM system and then I allocate 1 GB for each executor,
 and give number of executors as 8, then I am increasing the resource right?
 In this case, how do you explain?

 Thank You

 On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 Note that the parallelism (i.e., number of partitions) is just an upper
 bound on how much of the work can be done in parallel. If you have 200
 partitions, then you can divide the work among between 1 and 200 cores and
 all resources will remain utilized. If you have more than 200 cores,
 though, then some will not be used, so you would want to increase
 parallelism further. (There are other rules-of-thumb -- for instance, it's
 generally good to have at least 2x more partitions than cores for straggler
 mitigation, but these are essentially just optimizations.)

 Further note that when you increase the number of Executors for the same
 set of resources (i.e., starting 10 Executors on a single machine instead
 of 1), you make Spark's job harder. Spark has to communicate in an
 all-to-all manner across Executors for shuffle operations, and it uses TCP
 sockets to do so whether or not the Executors happen to be on the same
 machine. So increasing Executors without increasing physical resources
 means Spark has to do more communication to do the same work.

 We expect that increasing the number of Executors by a factor of 10,
 given an increase in the number of physical resources by the same factor,
 would also improve performance by 10x. This is not always the case for the
 precise reason above (increased communication overhead), but typically we
 can get close. The actual observed improvement is very algorithm-dependent,
 though; for instance, some ML algorithms become hard to scale out past a
 certain point because the increase in communication overhead outweighs the
 increase in parallelism.

 On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 So, if I keep the number of instances constant and increase the degree
 of parallelism in steps, can I expect the performance to increase?

 Thank You

 On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always
 predict the performance of an app with the increase in number of worker
 instances, the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I
 am just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com
 wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time 

Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
In this case, I just wanted to know if a single node cluster with various
workers act like a simulator of a multi-node cluster with various nodes.
Like, if we have a single node cluster with 10 workers, say, then can we
tell that the same behavior will take place with cluster of 10 nodes?
It is like, without having the 10 nodes cluster, I can know the behavior of
the application in 10 nodes cluster by having a single node with 10
workers. The time taken may vary but I am talking about the behavior. Can
we say that?

On Sat, Feb 21, 2015 at 8:21 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Yes, I am talking about standalone single node cluster.

 No, I am not increasing parallelism. I just wanted to know if it is
 natural. Does message passing across the workers account for the happenning?

 I am running SparkKMeans, just to validate one prediction model. I am
 using several data sets. I have a standalone mode. I am varying the workers
 from 1 to 16

 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:

 I can imagine a few reasons. Adding workers might cause fewer tasks to
 execute locally (?) So you may be execute more remotely.

 Are you increasing parallelism? for trivial jobs, chopping them up
 further may cause you to pay more overhead of managing so many small
 tasks, for no speed up in execution time.

 Can you provide any more specifics though? you haven't said what
 you're running, what mode, how many workers, how long it takes, etc.

 On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I have been running some jobs in my local single node stand alone
 cluster. I
  am varying the worker instances for the same job, and the time taken
 for the
  job to complete increases with increase in the number of workers. I
 repeated
  some experiments varying the number of nodes in a cluster too and the
 same
  behavior is seen.
  Can the idea of worker instances be extrapolated to the nodes in a
 cluster?
 
  Thank You





Re: Worker and Nodes

2015-02-21 Thread Frank Austin Nothaft
There could be many different things causing this. For example, if you only 
have a single partition of data, increasing the number of tasks will only 
increase execution time due to higher scheduling overhead. Additionally, how 
large is a single partition in your application relative to the amount of 
memory on the machine? If you are running on a machine with a small amount of 
memory, increasing the number of executors per machine may increase GC/memory 
pressure. On a single node, since your executors share a memory and I/O system, 
you could just thrash everything.

In any case, you can’t normally generalize between increased parallelism on a 
single node and increased parallelism across a cluster. If you are purely 
limited by CPU, then yes, you can normally make that generalization. However, 
when you increase the number of workers in a cluster, you are providing your 
app with more resources (memory capacity and bandwidth, and disk bandwidth). 
When you increase the number of tasks executing on a single node, you do not 
increase the pool of available resources.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Feb 21, 2015, at 4:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote:

 No, I just have a single node standalone cluster.
 
 I am not tweaking around with the code to increase parallelism. I am just 
 running SparkKMeans that is there in Spark-1.0.0
 I just wanted to know, if this behavior is natural. And if so, what causes 
 this?
 
 Thank you
 
 On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 What's your storage like? are you adding worker machines that are
 remote from where the data lives? I wonder if it just means you are
 spending more and more time sending the data over the network as you
 try to ship more of it to more remote workers.
 
 To answer your question, no in general more workers means more
 parallelism and therefore faster execution. But that depends on a lot
 of things. For example, if your process isn't parallelize to use all
 available execution slots, adding more slots doesn't do anything.
 
 On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com 
 wrote:
  Yes, I am talking about standalone single node cluster.
 
  No, I am not increasing parallelism. I just wanted to know if it is natural.
  Does message passing across the workers account for the happenning?
 
  I am running SparkKMeans, just to validate one prediction model. I am using
  several data sets. I have a standalone mode. I am varying the workers from 1
  to 16
 
  On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:
 
  I can imagine a few reasons. Adding workers might cause fewer tasks to
  execute locally (?) So you may be execute more remotely.
 
  Are you increasing parallelism? for trivial jobs, chopping them up
  further may cause you to pay more overhead of managing so many small
  tasks, for no speed up in execution time.
 
  Can you provide any more specifics though? you haven't said what
  you're running, what mode, how many workers, how long it takes, etc.
 
  On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:
   Hi,
   I have been running some jobs in my local single node stand alone
   cluster. I
   am varying the worker instances for the same job, and the time taken for
   the
   job to complete increases with increase in the number of workers. I
   repeated
   some experiments varying the number of nodes in a cluster too and the
   same
   behavior is seen.
   Can the idea of worker instances be extrapolated to the nodes in a
   cluster?
  
   Thank You
 
 
 



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
So, with the increase in the number of worker instances, if I also increase
the degree of parallelism, will it make any difference?
I can use this model even the other way round right? I can always predict
the performance of an app with the increase in number of worker instances,
the deterioration in performance, right?

Thank You

On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I am
   using
   several data sets. I have a standalone mode. I am varying the workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer tasks
 to
   execute locally (?) So you may be execute more remotely.
  
   Are you increasing parallelism? for trivial jobs, chopping them up
   further may cause you to pay more overhead of managing so many small
   tasks, for no speed up in execution time.
  
   Can you provide any more specifics though? you haven't said what
   you're running, what mode, how many workers, how long it takes, etc.
  
   On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
   pradhandeep1...@gmail.com
   wrote:
Hi,
I have been running some jobs in my local single node stand alone
cluster. I
am varying the worker instances for the same job, and the time
 taken
for
the
job to complete increases with increase in the number of workers.
 I
repeated
some experiments varying the number of nodes in a cluster too and
 the
same
behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in a
cluster?
   
Thank You
  
  
 
 





Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
Yes, I am talking about standalone single node cluster.

No, I am not increasing parallelism. I just wanted to know if it is
natural. Does message passing across the workers account for the happenning?

I am running SparkKMeans, just to validate one prediction model. I am using
several data sets. I have a standalone mode. I am varying the workers from
1 to 16

On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:

 I can imagine a few reasons. Adding workers might cause fewer tasks to
 execute locally (?) So you may be execute more remotely.

 Are you increasing parallelism? for trivial jobs, chopping them up
 further may cause you to pay more overhead of managing so many small
 tasks, for no speed up in execution time.

 Can you provide any more specifics though? you haven't said what
 you're running, what mode, how many workers, how long it takes, etc.

 On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I have been running some jobs in my local single node stand alone
 cluster. I
  am varying the worker instances for the same job, and the time taken for
 the
  job to complete increases with increase in the number of workers. I
 repeated
  some experiments varying the number of nodes in a cluster too and the
 same
  behavior is seen.
  Can the idea of worker instances be extrapolated to the nodes in a
 cluster?
 
  Thank You



Re: Worker and Nodes

2015-02-21 Thread Sean Owen
What's your storage like? are you adding worker machines that are
remote from where the data lives? I wonder if it just means you are
spending more and more time sending the data over the network as you
try to ship more of it to more remote workers.

To answer your question, no in general more workers means more
parallelism and therefore faster execution. But that depends on a lot
of things. For example, if your process isn't parallelize to use all
available execution slots, adding more slots doesn't do anything.

On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com wrote:
 Yes, I am talking about standalone single node cluster.

 No, I am not increasing parallelism. I just wanted to know if it is natural.
 Does message passing across the workers account for the happenning?

 I am running SparkKMeans, just to validate one prediction model. I am using
 several data sets. I have a standalone mode. I am varying the workers from 1
 to 16

 On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:

 I can imagine a few reasons. Adding workers might cause fewer tasks to
 execute locally (?) So you may be execute more remotely.

 Are you increasing parallelism? for trivial jobs, chopping them up
 further may cause you to pay more overhead of managing so many small
 tasks, for no speed up in execution time.

 Can you provide any more specifics though? you haven't said what
 you're running, what mode, how many workers, how long it takes, etc.

 On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Hi,
  I have been running some jobs in my local single node stand alone
  cluster. I
  am varying the worker instances for the same job, and the time taken for
  the
  job to complete increases with increase in the number of workers. I
  repeated
  some experiments varying the number of nodes in a cluster too and the
  same
  behavior is seen.
  Can the idea of worker instances be extrapolated to the nodes in a
  cluster?
 
  Thank You



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
No, I just have a single node standalone cluster.

I am not tweaking around with the code to increase parallelism. I am just
running SparkKMeans that is there in Spark-1.0.0
I just wanted to know, if this behavior is natural. And if so, what causes
this?

Thank you

On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:

 What's your storage like? are you adding worker machines that are
 remote from where the data lives? I wonder if it just means you are
 spending more and more time sending the data over the network as you
 try to ship more of it to more remote workers.

 To answer your question, no in general more workers means more
 parallelism and therefore faster execution. But that depends on a lot
 of things. For example, if your process isn't parallelize to use all
 available execution slots, adding more slots doesn't do anything.

 On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Yes, I am talking about standalone single node cluster.
 
  No, I am not increasing parallelism. I just wanted to know if it is
 natural.
  Does message passing across the workers account for the happenning?
 
  I am running SparkKMeans, just to validate one prediction model. I am
 using
  several data sets. I have a standalone mode. I am varying the workers
 from 1
  to 16
 
  On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:
 
  I can imagine a few reasons. Adding workers might cause fewer tasks to
  execute locally (?) So you may be execute more remotely.
 
  Are you increasing parallelism? for trivial jobs, chopping them up
  further may cause you to pay more overhead of managing so many small
  tasks, for no speed up in execution time.
 
  Can you provide any more specifics though? you haven't said what
  you're running, what mode, how many workers, how long it takes, etc.
 
  On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Hi,
   I have been running some jobs in my local single node stand alone
   cluster. I
   am varying the worker instances for the same job, and the time taken
 for
   the
   job to complete increases with increase in the number of workers. I
   repeated
   some experiments varying the number of nodes in a cluster too and the
   same
   behavior is seen.
   Can the idea of worker instances be extrapolated to the nodes in a
   cluster?
  
   Thank You
 
 



Re: Worker and Nodes

2015-02-21 Thread Sean Owen
Workers has a specific meaning in Spark. You are running many on one
machine? that's possible but not usual.

Each worker's executors have access to a fraction of your machine's
resources then. If you're not increasing parallelism, maybe you're not
actually using additional workers, so are using less resource for your
problem.

Or because the resulting executors are smaller, maybe you're hitting
GC thrashing in these executors with smaller heaps.

Or if you're not actually configuring the executors to use less
memory, maybe you're over-committing your RAM and swapping?

Bottom line, you wouldn't use multiple workers on one small standalone
node. This isn't a good way to estimate performance on a distributed
cluster either.

On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com wrote:
 No, I just have a single node standalone cluster.

 I am not tweaking around with the code to increase parallelism. I am just
 running SparkKMeans that is there in Spark-1.0.0
 I just wanted to know, if this behavior is natural. And if so, what causes
 this?

 Thank you

 On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:

 What's your storage like? are you adding worker machines that are
 remote from where the data lives? I wonder if it just means you are
 spending more and more time sending the data over the network as you
 try to ship more of it to more remote workers.

 To answer your question, no in general more workers means more
 parallelism and therefore faster execution. But that depends on a lot
 of things. For example, if your process isn't parallelize to use all
 available execution slots, adding more slots doesn't do anything.

 On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  Yes, I am talking about standalone single node cluster.
 
  No, I am not increasing parallelism. I just wanted to know if it is
  natural.
  Does message passing across the workers account for the happenning?
 
  I am running SparkKMeans, just to validate one prediction model. I am
  using
  several data sets. I have a standalone mode. I am varying the workers
  from 1
  to 16
 
  On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com wrote:
 
  I can imagine a few reasons. Adding workers might cause fewer tasks to
  execute locally (?) So you may be execute more remotely.
 
  Are you increasing parallelism? for trivial jobs, chopping them up
  further may cause you to pay more overhead of managing so many small
  tasks, for no speed up in execution time.
 
  Can you provide any more specifics though? you haven't said what
  you're running, what mode, how many workers, how long it takes, etc.
 
  On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
  pradhandeep1...@gmail.com
  wrote:
   Hi,
   I have been running some jobs in my local single node stand alone
   cluster. I
   am varying the worker instances for the same job, and the time taken
   for
   the
   job to complete increases with increase in the number of workers. I
   repeated
   some experiments varying the number of nodes in a cluster too and the
   same
   behavior is seen.
   Can the idea of worker instances be extrapolated to the nodes in a
   cluster?
  
   Thank You
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
Yes, I have decreased the executor memory.
But,if I have to do this, then I have to tweak around with the code
corresponding to each configuration right?

On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I am
   using
   several data sets. I have a standalone mode. I am varying the workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer tasks
 to
   execute locally (?) So you may be execute more remotely.
  
   Are you increasing parallelism? for trivial jobs, chopping them up
   further may cause you to pay more overhead of managing so many small
   tasks, for no speed up in execution time.
  
   Can you provide any more specifics though? you haven't said what
   you're running, what mode, how many workers, how long it takes, etc.
  
   On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
   pradhandeep1...@gmail.com
   wrote:
Hi,
I have been running some jobs in my local single node stand alone
cluster. I
am varying the worker instances for the same job, and the time
 taken
for
the
job to complete increases with increase in the number of workers. I
repeated
some experiments varying the number of nodes in a cluster too and
 the
same
behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in a
cluster?
   
Thank You
  
  
 
 



Re: Worker and Nodes

2015-02-21 Thread Deep Pradhan
So, if I keep the number of instances constant and increase the degree of
parallelism in steps, can I expect the performance to increase?

Thank You

On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 So, with the increase in the number of worker instances, if I also
 increase the degree of parallelism, will it make any difference?
 I can use this model even the other way round right? I can always predict
 the performance of an app with the increase in number of worker instances,
 the deterioration in performance, right?

 Thank You

 On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Yes, I have decreased the executor memory.
 But,if I have to do this, then I have to tweak around with the code
 corresponding to each configuration right?

 On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen so...@cloudera.com wrote:

 Workers has a specific meaning in Spark. You are running many on one
 machine? that's possible but not usual.

 Each worker's executors have access to a fraction of your machine's
 resources then. If you're not increasing parallelism, maybe you're not
 actually using additional workers, so are using less resource for your
 problem.

 Or because the resulting executors are smaller, maybe you're hitting
 GC thrashing in these executors with smaller heaps.

 Or if you're not actually configuring the executors to use less
 memory, maybe you're over-committing your RAM and swapping?

 Bottom line, you wouldn't use multiple workers on one small standalone
 node. This isn't a good way to estimate performance on a distributed
 cluster either.

 On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:
  No, I just have a single node standalone cluster.
 
  I am not tweaking around with the code to increase parallelism. I am
 just
  running SparkKMeans that is there in Spark-1.0.0
  I just wanted to know, if this behavior is natural. And if so, what
 causes
  this?
 
  Thank you
 
  On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen so...@cloudera.com wrote:
 
  What's your storage like? are you adding worker machines that are
  remote from where the data lives? I wonder if it just means you are
  spending more and more time sending the data over the network as you
  try to ship more of it to more remote workers.
 
  To answer your question, no in general more workers means more
  parallelism and therefore faster execution. But that depends on a lot
  of things. For example, if your process isn't parallelize to use all
  available execution slots, adding more slots doesn't do anything.
 
  On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan 
 pradhandeep1...@gmail.com
  wrote:
   Yes, I am talking about standalone single node cluster.
  
   No, I am not increasing parallelism. I just wanted to know if it is
   natural.
   Does message passing across the workers account for the happenning?
  
   I am running SparkKMeans, just to validate one prediction model. I
 am
   using
   several data sets. I have a standalone mode. I am varying the
 workers
   from 1
   to 16
  
   On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen so...@cloudera.com
 wrote:
  
   I can imagine a few reasons. Adding workers might cause fewer
 tasks to
   execute locally (?) So you may be execute more remotely.
  
   Are you increasing parallelism? for trivial jobs, chopping them up
   further may cause you to pay more overhead of managing so many
 small
   tasks, for no speed up in execution time.
  
   Can you provide any more specifics though? you haven't said what
   you're running, what mode, how many workers, how long it takes,
 etc.
  
   On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
   pradhandeep1...@gmail.com
   wrote:
Hi,
I have been running some jobs in my local single node stand alone
cluster. I
am varying the worker instances for the same job, and the time
 taken
for
the
job to complete increases with increase in the number of
 workers. I
repeated
some experiments varying the number of nodes in a cluster too
 and the
same
behavior is seen.
Can the idea of worker instances be extrapolated to the nodes in
 a
cluster?
   
Thank You
  
  
 
 






Re: Worker and Nodes

2015-02-21 Thread Sean Owen
I can imagine a few reasons. Adding workers might cause fewer tasks to
execute locally (?) So you may be execute more remotely.

Are you increasing parallelism? for trivial jobs, chopping them up
further may cause you to pay more overhead of managing so many small
tasks, for no speed up in execution time.

Can you provide any more specifics though? you haven't said what
you're running, what mode, how many workers, how long it takes, etc.

On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan pradhandeep1...@gmail.com wrote:
 Hi,
 I have been running some jobs in my local single node stand alone cluster. I
 am varying the worker instances for the same job, and the time taken for the
 job to complete increases with increase in the number of workers. I repeated
 some experiments varying the number of nodes in a cluster too and the same
 behavior is seen.
 Can the idea of worker instances be extrapolated to the nodes in a cluster?

 Thank You

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Worker and Nodes

2015-02-21 Thread Yiannis Gkoufas
Hi,

I have experienced the same behavior. You are talking about standalone
cluster mode right?

BR

On 21 February 2015 at 14:37, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have been running some jobs in my local single node stand alone cluster.
 I am varying the worker instances for the same job, and the time taken for
 the job to complete increases with increase in the number of workers. I
 repeated some experiments varying the number of nodes in a cluster too and
 the same behavior is seen.
 Can the idea of worker instances be extrapolated to the nodes in a cluster?

 Thank You