subject:"Re\: where are my python lambda functions run in yarn\-client mode\?"

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas

It's true that it is an implementation detail, but it's a very important
one to document because it has the possibility of changing results
depending on when I use take or collect. The issue I was running in to was
when the executor had a different operating system than the driver, and I
was using 'pipe' with a binary I compiled myself. I needed to make sure I
used the binary compiled for the operating system I expect it to run on.
So in cases where I was only interested in the first value, my code was
breaking horribly on 1.0.2, but working fine on 1.1.

My only suggestion would be to backport 'spark.localExecution.enabled' to
the 1.0 line. Thanks for all your help!

Evan

On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote:

This is some kind of implementation details, so not documented :-(

If you think this is a blocker for you, you could create a JIRA, maybe
it's could be fixed in 1.0.3+.

Davies

On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote:
Thank you! I was looking for a config variable to that end, but I was
looking in Spark 1.0.2 documentation, since that was the version I had
the
problem with. Is this behavior documented in 1.0.2's documentation?

Evan

On 10/09/2014 04:12 PM, Davies Liu wrote:

When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in
cluster.
If you change set this to `true`, then you get the same behavior as 1.0.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com
wrote:

Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
with
my app, which will run in yarn-client mode. However, it appears when I
use
'map' to run a python lambda function over an RDD, they appear to be
run
on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of
its
nodes. I am submitting my jobs with a machine running Mac OS X 10.9.
As
a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets
parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're small
enough. I got the same answer (with only 1 million numbers).

This is a troubling difference. I would expect all functions run on an
RDD
to be executed on my worker nodes in the Hadoop cluster, but this is
clearly
not the case for 1.0.2. Why does this difference exist? How can I
accurately detect which jobs will run where?

Thank you,

Evan

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Davies Liu

Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915

On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas evan.sama...@gmail.com wrote:
It's true that it is an implementation detail, but it's a very important one
to document because it has the possibility of changing results depending on
when I use take or collect. The issue I was running in to was when the
executor had a different operating system than the driver, and I was using
'pipe' with a binary I compiled myself. I needed to make sure I used the
binary compiled for the operating system I expect it to run on. So in cases
where I was only interested in the first value, my code was breaking
horribly on 1.0.2, but working fine on 1.1.

My only suggestion would be to backport 'spark.localExecution.enabled' to
the 1.0 line. Thanks for all your help!

Evan

On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote:

This is some kind of implementation details, so not documented :-(

If you think this is a blocker for you, you could create a JIRA, maybe
it's could be fixed in 1.0.3+.

Davies

Evan

On 10/09/2014 04:12 PM, Davies Liu wrote:

When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com
wrote:

Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and
1.1.0
with
my app, which will run in yarn-client mode. However, it appears when
I
use
'map' to run a python lambda function over an RDD, they appear to be
run
on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of
its
nodes. I am submitting my jobs with a machine running Mac OS X 10.9.
As
a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets
parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're
small
enough. I got the same answer (with only 1 million numbers).

This is a troubling difference. I would expect all functions run on
an
RDD
to be executed on my worker nodes in the Hadoop cluster, but this is
clearly
not the case for 1.0.2. Why does this difference exist? How can I
accurately detect which jobs will run where?

Thank you,

Evan

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Evan

Thank you!  I was looking for a config variable to that end, but I was 
looking in Spark 1.0.2 documentation, since that was the version I had 
the problem with.  Is this behavior documented in 1.0.2's documentation?


Evan


On 10/09/2014 04:12 PM, Davies Liu wrote:

When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in cluster.
If you change set this to `true`, then you get the same behavior as 1.0.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote:

Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with
my app, which will run in yarn-client mode.  However, it appears when I use
'map' to run a python lambda function over an RDD, they appear to be run on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of its
nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're small
enough.  I got the same answer (with only 1 million numbers).

This is a troubling difference.  I would expect all functions run on an RDD
to be executed on my worker nodes in the Hadoop cluster, but this is clearly
not the case for 1.0.2.  Why does this difference exist?  How can I
accurately detect which jobs will run where?

Thank you,

Evan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Davies Liu

This is some kind of implementation details, so not documented :-(

If you think this is a blocker for you, you could create a JIRA, maybe
it's could be fixed in 1.0.3+.

Davies

On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote:
 Thank you!  I was looking for a config variable to that end, but I was
 looking in Spark 1.0.2 documentation, since that was the version I had the
 problem with.  Is this behavior documented in 1.0.2's documentation?

 Evan

 On 10/09/2014 04:12 PM, Davies Liu wrote:

 When you call rdd.take() or rdd.first(), it may[1] executor the job
 locally (in driver),
 otherwise, all the jobs are executed in cluster.

 There is config called `spark.localExecution.enabled` (since 1.1+) to
 change this,
 it's not enabled by default, so all the functions will be executed in
 cluster.
 If you change set this to `true`, then you get the same behavior as 1.0.

 [1] If it did not get enough items from the first partitions, it will
 try multiple partitions
 in a time, so they will be executed in cluster.

 On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote:

 Hi,

 I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0
 with
 my app, which will run in yarn-client mode.  However, it appears when I
 use
 'map' to run a python lambda function over an RDD, they appear to be run
 on
 different machines, and this is causing problems.

 In both cases, I am using a Hadoop cluster that runs linux on all of its
 nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As
 a
 reproducer, here is my script:

 import platform
 print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

 The answer in Spark 1.1.0:
 'Linux'

 The answer in Spark 1.0.2:
 'Darwin'

 In other experiments I changed the size of the list that gets
 parallelized,
 thinking maybe 1.0.2 just runs jobs on the driver node if they're small
 enough.  I got the same answer (with only 1 million numbers).

 This is a troubling difference.  I would expect all functions run on an
 RDD
 to be executed on my worker nodes in the Hadoop cluster, but this is
 clearly
 not the case for 1.0.2.  Why does this difference exist?  How can I
 accurately detect which jobs will run where?

 Thank you,

 Evan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: where are my python lambda functions run in yarn-client mode?

2014-10-09 Thread Davies Liu

When you call rdd.take() or rdd.first(), it may[1] executor the job
locally (in driver),
otherwise, all the jobs are executed in cluster.

There is config called `spark.localExecution.enabled` (since 1.1+) to
change this,
it's not enabled by default, so all the functions will be executed in cluster.
If you change set this to `true`, then you get the same behavior as 1.0.

[1] If it did not get enough items from the first partitions, it will
try multiple partitions
in a time, so they will be executed in cluster.

On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote:
Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with
my app, which will run in yarn-client mode. However, it appears when I use
'map' to run a python lambda function over an RDD, they appear to be run on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of its
nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're small
enough. I got the same answer (with only 1 million numbers).

This is a troubling difference. I would expect all functions run on an RDD
to be executed on my worker nodes in the Hadoop cluster, but this is clearly
not the case for 1.0.2. Why does this difference exist? How can I
accurately detect which jobs will run where?

Thank you,

Evan

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: where are my python lambda functions run in yarn-client mode?

Re: where are my python lambda functions run in yarn-client mode?

Re: where are my python lambda functions run in yarn-client mode?

Re: where are my python lambda functions run in yarn-client mode?

Re: where are my python lambda functions run in yarn-client mode?

5 matches

Site Navigation

Mail list logo

Footer information