Re: where are my python lambda functions run in yarn-client mode?
It's true that it is an implementation detail, but it's a very important one to document because it has the possibility of changing results depending on when I use take or collect. The issue I was running in to was when the executor had a different operating system than the driver, and I was using 'pipe' with a binary I compiled myself. I needed to make sure I used the binary compiled for the operating system I expect it to run on. So in cases where I was only interested in the first value, my code was breaking horribly on 1.0.2, but working fine on 1.1. My only suggestion would be to backport 'spark.localExecution.enabled' to the 1.0 line. Thanks for all your help! Evan On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote: This is some kind of implementation details, so not documented :-( If you think this is a blocker for you, you could create a JIRA, maybe it's could be fixed in 1.0.3+. Davies On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote: Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take() or rdd.first(), it may[1] executor the job locally (in driver), otherwise, all the jobs are executed in cluster. There is config called `spark.localExecution.enabled` (since 1.1+) to change this, it's not enabled by default, so all the functions will be executed in cluster. If you change set this to `true`, then you get the same behavior as 1.0. [1] If it did not get enough items from the first partitions, it will try multiple partitions in a time, so they will be executed in cluster. On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote: Hi, I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with my app, which will run in yarn-client mode. However, it appears when I use 'map' to run a python lambda function over an RDD, they appear to be run on different machines, and this is causing problems. In both cases, I am using a Hadoop cluster that runs linux on all of its nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a reproducer, here is my script: import platform print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] The answer in Spark 1.1.0: 'Linux' The answer in Spark 1.0.2: 'Darwin' In other experiments I changed the size of the list that gets parallelized, thinking maybe 1.0.2 just runs jobs on the driver node if they're small enough. I got the same answer (with only 1 million numbers). This is a troubling difference. I would expect all functions run on an RDD to be executed on my worker nodes in the Hadoop cluster, but this is clearly not the case for 1.0.2. Why does this difference exist? How can I accurately detect which jobs will run where? Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: where are my python lambda functions run in yarn-client mode?
Created JIRA for this: https://issues.apache.org/jira/browse/SPARK-3915 On Sat, Oct 11, 2014 at 12:40 PM, Evan Samanas evan.sama...@gmail.com wrote: It's true that it is an implementation detail, but it's a very important one to document because it has the possibility of changing results depending on when I use take or collect. The issue I was running in to was when the executor had a different operating system than the driver, and I was using 'pipe' with a binary I compiled myself. I needed to make sure I used the binary compiled for the operating system I expect it to run on. So in cases where I was only interested in the first value, my code was breaking horribly on 1.0.2, but working fine on 1.1. My only suggestion would be to backport 'spark.localExecution.enabled' to the 1.0 line. Thanks for all your help! Evan On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu dav...@databricks.com wrote: This is some kind of implementation details, so not documented :-( If you think this is a blocker for you, you could create a JIRA, maybe it's could be fixed in 1.0.3+. Davies On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote: Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take() or rdd.first(), it may[1] executor the job locally (in driver), otherwise, all the jobs are executed in cluster. There is config called `spark.localExecution.enabled` (since 1.1+) to change this, it's not enabled by default, so all the functions will be executed in cluster. If you change set this to `true`, then you get the same behavior as 1.0. [1] If it did not get enough items from the first partitions, it will try multiple partitions in a time, so they will be executed in cluster. On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote: Hi, I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with my app, which will run in yarn-client mode. However, it appears when I use 'map' to run a python lambda function over an RDD, they appear to be run on different machines, and this is causing problems. In both cases, I am using a Hadoop cluster that runs linux on all of its nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a reproducer, here is my script: import platform print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] The answer in Spark 1.1.0: 'Linux' The answer in Spark 1.0.2: 'Darwin' In other experiments I changed the size of the list that gets parallelized, thinking maybe 1.0.2 just runs jobs on the driver node if they're small enough. I got the same answer (with only 1 million numbers). This is a troubling difference. I would expect all functions run on an RDD to be executed on my worker nodes in the Hadoop cluster, but this is clearly not the case for 1.0.2. Why does this difference exist? How can I accurately detect which jobs will run where? Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: where are my python lambda functions run in yarn-client mode?
Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take() or rdd.first(), it may[1] executor the job locally (in driver), otherwise, all the jobs are executed in cluster. There is config called `spark.localExecution.enabled` (since 1.1+) to change this, it's not enabled by default, so all the functions will be executed in cluster. If you change set this to `true`, then you get the same behavior as 1.0. [1] If it did not get enough items from the first partitions, it will try multiple partitions in a time, so they will be executed in cluster. On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote: Hi, I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with my app, which will run in yarn-client mode. However, it appears when I use 'map' to run a python lambda function over an RDD, they appear to be run on different machines, and this is causing problems. In both cases, I am using a Hadoop cluster that runs linux on all of its nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a reproducer, here is my script: import platform print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] The answer in Spark 1.1.0: 'Linux' The answer in Spark 1.0.2: 'Darwin' In other experiments I changed the size of the list that gets parallelized, thinking maybe 1.0.2 just runs jobs on the driver node if they're small enough. I got the same answer (with only 1 million numbers). This is a troubling difference. I would expect all functions run on an RDD to be executed on my worker nodes in the Hadoop cluster, but this is clearly not the case for 1.0.2. Why does this difference exist? How can I accurately detect which jobs will run where? Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: where are my python lambda functions run in yarn-client mode?
This is some kind of implementation details, so not documented :-( If you think this is a blocker for you, you could create a JIRA, maybe it's could be fixed in 1.0.3+. Davies On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote: Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take() or rdd.first(), it may[1] executor the job locally (in driver), otherwise, all the jobs are executed in cluster. There is config called `spark.localExecution.enabled` (since 1.1+) to change this, it's not enabled by default, so all the functions will be executed in cluster. If you change set this to `true`, then you get the same behavior as 1.0. [1] If it did not get enough items from the first partitions, it will try multiple partitions in a time, so they will be executed in cluster. On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote: Hi, I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with my app, which will run in yarn-client mode. However, it appears when I use 'map' to run a python lambda function over an RDD, they appear to be run on different machines, and this is causing problems. In both cases, I am using a Hadoop cluster that runs linux on all of its nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a reproducer, here is my script: import platform print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] The answer in Spark 1.1.0: 'Linux' The answer in Spark 1.0.2: 'Darwin' In other experiments I changed the size of the list that gets parallelized, thinking maybe 1.0.2 just runs jobs on the driver node if they're small enough. I got the same answer (with only 1 million numbers). This is a troubling difference. I would expect all functions run on an RDD to be executed on my worker nodes in the Hadoop cluster, but this is clearly not the case for 1.0.2. Why does this difference exist? How can I accurately detect which jobs will run where? Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: where are my python lambda functions run in yarn-client mode?
When you call rdd.take() or rdd.first(), it may[1] executor the job locally (in driver), otherwise, all the jobs are executed in cluster. There is config called `spark.localExecution.enabled` (since 1.1+) to change this, it's not enabled by default, so all the functions will be executed in cluster. If you change set this to `true`, then you get the same behavior as 1.0. [1] If it did not get enough items from the first partitions, it will try multiple partitions in a time, so they will be executed in cluster. On Thu, Oct 9, 2014 at 12:14 PM, esamanas evan.sama...@gmail.com wrote: Hi, I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with my app, which will run in yarn-client mode. However, it appears when I use 'map' to run a python lambda function over an RDD, they appear to be run on different machines, and this is causing problems. In both cases, I am using a Hadoop cluster that runs linux on all of its nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a reproducer, here is my script: import platform print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] The answer in Spark 1.1.0: 'Linux' The answer in Spark 1.0.2: 'Darwin' In other experiments I changed the size of the list that gets parallelized, thinking maybe 1.0.2 just runs jobs on the driver node if they're small enough. I got the same answer (with only 1 million numbers). This is a troubling difference. I would expect all functions run on an RDD to be executed on my worker nodes in the Hadoop cluster, but this is clearly not the case for 1.0.2. Why does this difference exist? How can I accurately detect which jobs will run where? Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org