Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas
It's true that it is an implementation detail, but it's a very important
one to document because it has the possibility of changing results
depending on when I use take or collect.  The issue I was running in to was
when the executor had a different operating system than the driver, and I
was using 'pipe' with a binary I compiled myself.  I needed to make sure I
used the binary compiled for the operating system I expect it to run on.
So in cases where I was only interested in the first value, my code was
breaking horribly on 1.0.2, but working fine on 1.1.

My only suggestion would be to backport 'spark.localExecution.enabled' to
the 1.0 line.  Thanks for all your help!

Evan

On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote:

 This is some kind of implementation details, so not documented :-(

 If you think this is a blocker for you, you could create a JIRA, maybe
 it's could be fixed in 1.0.3+.

 Davies

 On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote:
  Thank you!  I was looking for a config variable to that end, but I was
  looking in Spark 1.0.2 documentation, since that was the version I had
 the
  problem with.  Is this behavior documented in 1.0.2's documentation?
 
  Evan
 
  On 10/09/2014 04:12 PM, Davies Liu wrote:
 
  When you call rdd.take() or rdd.first(), it may[1] executor the job
  locally (in driver),
  otherwise, all the jobs are executed in cluster.
 
  There is config called `spark.localExecution.enabled` (since 1.1+) to
  change this,
  it's not enabled by default, so all the functions will be executed in
  cluster.
  If you change set this to `true`, then you get the same behavior as 1.0.
 
  [1] If it did not get enough items from the first partitions, it will
  try multiple partitions
  in a time, so they will be executed in cluster.
 
  On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com
 wrote:
 
  Hi,
 
  I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
  with
  my app, which will run in yarn-client mode.  However, it appears when I
  use
  'map' to run a python lambda function over an RDD, they appear to be
 run
  on
  different machines, and this is causing problems.
 
  In both cases, I am using a Hadoop cluster that runs linux on all of
 its
  nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.
 As
  a
  reproducer, here is my script:
 
  import platform
  print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
 
  The answer in Spark 1.1.0:
  'Linux'
 
  The answer in Spark 1.0.2:
  'Darwin'
 
  In other experiments I changed the size of the list that gets
  parallelized,
  thinking maybe 1.0.2 just runs jobs on the driver node if they're small
  enough.  I got the same answer (with only 1 million numbers).
 
  This is a troubling difference.  I would expect all functions run on an
  RDD
  to be executed on my worker nodes in the Hadoop cluster, but this is
  clearly
  not the case for 1.0.2.  Why does this difference exist?  How can I
  accurately detect which jobs will run where?
 
  Thank you,
 
  Evan
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Davies Liu
Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915

On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas evan.sama...@gmail.com wrote:
 It's true that it is an implementation detail, but it's a very important one
 to document because it has the possibility of changing results depending on
 when I use take or collect.  The issue I was running in to was when the
 executor had a different operating system than the driver, and I was using
 'pipe' with a binary I compiled myself.  I needed to make sure I used the
 binary compiled for the operating system I expect it to run on.  So in cases
 where I was only interested in the first value, my code was breaking
 horribly on 1.0.2, but working fine on 1.1.

 My only suggestion would be to backport 'spark.localExecution.enabled' to
 the 1.0 line.  Thanks for all your help!

 Evan

 On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote:

 This is some kind of implementation details, so not documented :-(

 If you think this is a blocker for you, you could create a JIRA, maybe
 it's could be fixed in 1.0.3+.

 Davies

 On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote:
  Thank you!  I was looking for a config variable to that end, but I was
  looking in Spark 1.0.2 documentation, since that was the version I had
  the
  problem with.  Is this behavior documented in 1.0.2's documentation?
 
  Evan
 
  On 10/09/2014 04:12 PM, Davies Liu wrote:
 
  When you call rdd.take() or rdd.first(), it may[1] executor the job
  locally (in driver),
  otherwise, all the jobs are executed in cluster.
 
  There is config called `spark.localExecution.enabled` (since 1.1+) to
  change this,
  it's not enabled by default, so all the functions will be executed in
  cluster.
  If you change set this to `true`, then you get the same behavior as
  1.0.
 
  [1] If it did not get enough items from the first partitions, it will
  try multiple partitions
  in a time, so they will be executed in cluster.
 
  On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com
  wrote:
 
  Hi,
 
  I am using pyspark and I'm trying to support both Spark 1.0.2 and
  1.1.0
  with
  my app, which will run in yarn-client mode.  However, it appears when
  I
  use
  'map' to run a python lambda function over an RDD, they appear to be
  run
  on
  different machines, and this is causing problems.
 
  In both cases, I am using a Hadoop cluster that runs linux on all of
  its
  nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.
  As
  a
  reproducer, here is my script:
 
  import platform
  print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]
 
  The answer in Spark 1.1.0:
  'Linux'
 
  The answer in Spark 1.0.2:
  'Darwin'
 
  In other experiments I changed the size of the list that gets
  parallelized,
  thinking maybe 1.0.2 just runs jobs on the driver node if they're
  small
  enough.  I got the same answer (with only 1 million numbers).
 
  This is a troubling difference.  I would expect all functions run on
  an
  RDD
  to be executed on my worker nodes in the Hadoop cluster, but this is
  clearly
  not the case for 1.0.2.  Why does this difference exist?  How can I
  accurately detect which jobs will run where?
 
  Thank you,
 
  Evan
 
 
 
 
  --
  View this message in context:
 
  http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
  Sent from the Apache Spark User List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Evan
Thank you!  I was looking for a config variable to that end, but I was 
looking in Spark 1.0.2 documentation, since that was the version I had 
the problem with.  Is this behavior documented in 1.0.2's documentation?


Evan


On 10/09/2014 04:12 PM, Davies Liu wrote:

When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in cluster.
If you change set this to `true`, then you get the same behavior as 1.0.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote:

Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with
my app, which will run in yarn-client mode.  However, it appears when I use
'map' to run a python lambda function over an RDD, they appear to be run on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of its
nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're small
enough.  I got the same answer (with only 1 million numbers).

This is a troubling difference.  I would expect all functions run on an RDD
to be executed on my worker nodes in the Hadoop cluster, but this is clearly
not the case for 1.0.2.  Why does this difference exist?  How can I
accurately detect which jobs will run where?

Thank you,

Evan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Davies Liu
This is some kind of implementation details, so not documented :-(

If you think this is a blocker for you, you could create a JIRA, maybe
it's could be fixed in 1.0.3+.

Davies

On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote:
 Thank you!  I was looking for a config variable to that end, but I was
 looking in Spark 1.0.2 documentation, since that was the version I had the
 problem with.  Is this behavior documented in 1.0.2's documentation?

 Evan

 On 10/09/2014 04:12 PM, Davies Liu wrote:

 When you call rdd.take() or rdd.first(), it may[1] executor the job
 locally (in driver),
 otherwise, all the jobs are executed in cluster.

 There is config called `spark.localExecution.enabled` (since 1.1+) to
 change this,
 it's not enabled by default, so all the functions will be executed in
 cluster.
 If you change set this to `true`, then you get the same behavior as 1.0.

 [1] If it did not get enough items from the first partitions, it will
 try multiple partitions
 in a time, so they will be executed in cluster.

 On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote:

 Hi,

 I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
 with
 my app, which will run in yarn-client mode.  However, it appears when I
 use
 'map' to run a python lambda function over an RDD, they appear to be run
 on
 different machines, and this is causing problems.

 In both cases, I am using a Hadoop cluster that runs linux on all of its
 nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As
 a
 reproducer, here is my script:

 import platform
 print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

 The answer in Spark 1.1.0:
 'Linux'

 The answer in Spark 1.0.2:
 'Darwin'

 In other experiments I changed the size of the list that gets
 parallelized,
 thinking maybe 1.0.2 just runs jobs on the driver node if they're small
 enough.  I got the same answer (with only 1 million numbers).

 This is a troubling difference.  I would expect all functions run on an
 RDD
 to be executed on my worker nodes in the Hadoop cluster, but this is
 clearly
 not the case for 1.0.2.  Why does this difference exist?  How can I
 accurately detect which jobs will run where?

 Thank you,

 Evan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: where are my python lambda functions run in yarn-client mode?

2014-10-09 Thread Davies Liu
When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in cluster.
If you change set this to `true`, then you get the same behavior as 1.0.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote:
 Hi,

 I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with
 my app, which will run in yarn-client mode.  However, it appears when I use
 'map' to run a python lambda function over an RDD, they appear to be run on
 different machines, and this is causing problems.

 In both cases, I am using a Hadoop cluster that runs linux on all of its
 nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As a
 reproducer, here is my script:

 import platform
 print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

 The answer in Spark 1.1.0:
 'Linux'

 The answer in Spark 1.0.2:
 'Darwin'

 In other experiments I changed the size of the list that gets parallelized,
 thinking maybe 1.0.2 just runs jobs on the driver node if they're small
 enough.  I got the same answer (with only 1 million numbers).

 This is a troubling difference.  I would expect all functions run on an RDD
 to be executed on my worker nodes in the Hadoop cluster, but this is clearly
 not the case for 1.0.2.  Why does this difference exist?  How can I
 accurately detect which jobs will run where?

 Thank you,

 Evan




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org