Re: Typed datataset from Avro generated classes?

2020-04-20 Thread Elkhan Dadashov
Hi Spark users,

Did anyone resolve this issue?

Encoder encoder =
Encoders.bean(AvroGenereatedClass.class);
Dataset ds =
sparkSession.read().parquet(filename).as(encoder);

I'm also facing the same problem: "Cannot have circular references in bean
class, but got the circular reference of class class
org.apache.avro.Schema"

This happens due to getSchema() method in a generated Avro Java class.

How can I get a typed dataset from Avro generated classes?

Thanks.

On Wed, Sep 27, 2017 at 3:23 AM Joaquin Tarraga 
wrote:

> Hi all,
>
> I have an  Avro generated class (e.g., AvroGenerateClass) and I am using 
> Encoders.bean to get a typed dataset (e.g., Dataset):
>
> Encoder encoder = 
> Encoders.bean(AvroGenereatedClass.class);
>
> Dataset ds = 
> sparkSession.read().parquet(filename).as(encoder);
>
> I am getting an exception from the Encoders.bean call:
> "java.lang.UnsupportedOperationException: Cannot have circular references
> in bean class, but got the circular reference of class class
> org.apache.avro.Schema"
>
> How can I get a typed dataset from Avro generated classes?
>
> Thanks.
> --
> Joaquín
>
>

-- 

Best regards,
Elkhan Dadashov


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-11-15 Thread Elkhan Dadashov
Thanks for the clarification, Marcelo.

On Tue, Nov 15, 2016 at 6:20 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> On Tue, Nov 15, 2016 at 5:57 PM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > This is confusing in the sense that, the client needs to stay alive for
> > Spark Job to finish successfully.
> >
> > Actually the client can die  or finish (in Yarn-cluster mode), and the
> spark
> > job will successfully finish.
>
> That's an internal class, and you're looking at an internal javadoc
> that describes how the app handle works. For the app handle to be
> updated, the "client" (i.e. the sub process) needs to stay alive. So
> the javadoc is correct. It has nothing to do with whether the
> application succeeds or not.
>
>
> --
> Marcelo
>


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-11-15 Thread Elkhan Dadashov
Hi Marcelo,

This part of the JaaDoc is confusing:

https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java


"
* In *cluster mode*, this means that the client that launches the
* application *must remain alive for the duration of the application* (or
until the app handle is
* disconnected).
*/
class LauncherServer implements Closeable {
"
This is confusing in the sense that, the client needs to stay alive for
Spark Job to finish successfully.

Actually the client can die  or finish (in Yarn-cluster mode), and the
spark job will successfully finish.

Yeah, the client needs to stay alive until appHandle state is Submitted (or
maybe Running), but not until Final state, unless you want to query the
state of Spark app using appHandle.

I still do not get the meaning of the comment above.

Thanks.


On Tue, Oct 18, 2016 at 3:07 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > Does my map task need to wait until Spark job finishes ?
>
> No...
>
> > Or is there any way, my map task finishes after launching Spark job, and
> I
> > can still query and get status of Spark job outside of map task (or
> failure
> > reason, if it has failed) ? (maybe by querying Spark job id ?)
>
> ...but if the SparkLauncher handle goes away, then you lose the
> ability to track the app's state, unless you talk directly to the
> cluster manager.
>
> > I guess also if i want my Spark job to be killed, if corresponding
> delegator
> > map task is killed, that means my map task needs to stay alive, so i
> still
> > have SparkAppHandle reference ?
>
> Correct, unless you talk directly to the cluster manager.
>
> --
> Marcelo
>


Re: Correct SparkLauncher usage

2016-11-12 Thread Elkhan Dadashov
Hey Mohammad,

I implemented the code using CountDownLatch, and SparkLauncher works as
expected. Hope it helps.

Whenever appHandle.getState() reaching one of The Final states, then
countDownLatch is decreased, and execution returns back to main program.


...final CountDownLatch countDownLatch = new CountDownLatch(1);

SparkAppListener sparkAppListener = new SparkAppListener(countDownLatch);

SparkAppHandle appHandle = sparkLauncher.startApplication(sparkAppListener);

Thread sparkAppListenerThread = new Thread(sparkAppListener);


// More generic way of the above 3 lines in the below 3 commented lines

// SparkAppHandle.Listener sparkAppListener = new
SparkAppListener(countDownLatch);// use (Factory kind of) getter
method

// SparkAppHandle appHandle = sparkLauncher.startApplication(sparkAppListener);

// Thread sparkAppListenerThread = new Thread((Runnable)sparkAppListener);
Thread sparkAppListenerThread = new Thread(sparkAppListener);
sparkAppListenerThread.start();long timeout = 120;
countDownLatch.await(timeout, TimeUnit.SECONDS);
...

private static class SparkAppListener implements
SparkAppHandle.Listener, Runnable {
private static final Log log =
LogFactory.getLog(SparkAppListener.class);
private final CountDownLatch countDownLatch;
public SparkAppListener(CountDownLatch countDownLatch) {
this.countDownLatch = countDownLatch;
}
@Override
public void stateChanged(SparkAppHandle handle) {
String sparkAppId = handle.getAppId();
State appState = handle.getState();
if (sparkAppId != null) {
log.info("Spark job with app id: " + sparkAppId + ",\t
State changed to: " + appState + " - "
+ SPARK_STATE_MSG.get(appState));
} else {
log.info("Spark job's state changed to: " + appState +
" - " + SPARK_STATE_MSG.get(appState));
}
if (appState != null && appState.isFinal()) {
countDownLatch.countDown();
}
}
@Override
public void infoChanged(SparkAppHandle handle) {}
@Override
public void run() {}
}


On Thu, Nov 10, 2016 at 3:08 PM Mohammad Tariq  wrote:

> Sure, will look into the tests.
>
> Thanks you so much for your time!
>
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> 
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>
> On Fri, Nov 11, 2016 at 4:35 AM, Marcelo Vanzin 
> wrote:
>
> Sorry, it's kinda hard to give any more feedback from just the info you
> provided.
>
> I'd start with some working code like this from Spark's own unit tests:
>
> https://github.com/apache/spark/blob/a8ea4da8d04c1ed621a96668118f20739145edd2/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala#L164
>
>
> On Thu, Nov 10, 2016 at 3:00 PM, Mohammad Tariq 
> wrote:
>
> All I want to do is submit a job, and keep on getting states as soon as it
> changes, and come out once the job is over. I'm sorry to be a pest of
> questions. Kind of having a bit of tough time making this work.
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> 
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>
> On Fri, Nov 11, 2016 at 4:27 AM, Mohammad Tariq 
> wrote:
>
> Yeah, that definitely makes sense. I was just trying to make it work
> somehow. The problem is that it's not at all calling the listeners, hence
> i'm unable to do anything. Just wanted to cross check it by looping inside.
> But I get the point. thank you for that!
>
> I'm on YARN(cluster mode).
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> 
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>
> On Fri, Nov 11, 2016 at 4:19 AM, Marcelo Vanzin 
> wrote:
>
> On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq 
> wrote:
> >   @Override
> >   public void stateChanged(SparkAppHandle handle) {
> > System.out.println("Spark App Id [" + handle.getAppId() + "]. State
> [" + handle.getState() + "]");
> > while(!handle.getState().isFinal()) {
>
> You shouldn't loop in an event handler. That's not really how
> listeners work. Instead, use the event handler to update some local
> state, or signal some thread that's waiting for the state change.
>
> Also be aware that handles currently only work in local and yarn
> modes; the state updates haven't been hooked up to standalone mode

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Elkhan Dadashov
In my particular case (to make Spark launching asynchronous), i launch
Hadoop job, which consists of only 1 Spark job - which is launched via
SparkLauncher#startApplication().

My App --- Launches Map task() --> into Cluster
   Map Task
launches Spark job from Map YARN container --
SparkLauncher.startApplication() ---> New Child Process is spawned
(SparkSubmit is this child process)

I was not sure if in this case map task configs and Yarn configs impose any
restrictions into SparkSubmit process which is started as child process .

Because SparkLauncher#startApplication() launches SparkSubmit as new child
process of Map Yarn container.

If i understood it correctly, then driver will use default memory configs
of Spark (1g) or the value specified by the user via spark.driver.memory.

I did not use Spark since last year 1.5 version, now transitioning directly
to 2.0 version, will read about Unified Memory Manager.

Thanks, Owen.



On Sat, Nov 12, 2016 at 1:40 AM Sean Owen <so...@cloudera.com> wrote:

> Indeed, you get default values if you don't specify concrete values
> otherwise. Yes, you should see the docs for the version you're using.
>
> Note that there are different configs for the new 'unified' memory manager
> since 1.6, and so some older resources may be correctly explaining the
> older 'legacy' memory manager configs.
>
> Yes all containers would have to be smaller than YARN's max allowed size.
> The driver just consumes what the driver consumes; I don't know of any
> extra 'appmaster' component.
> What do you mean by 'launched by the map task'? jobs are launched by the
> driver only.
>
> On Sat, Nov 12, 2016 at 9:14 AM Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
>
> @Sean Owen,
>
> Thanks for your reply.
>
> I put the wrong link to the blog post. Here is the correct link
> <https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-settings/>
> which describes Spark Memory settings on Yarn. I guess they have misused
> the terms Spark driver/BlockManager, and explained memory usage of driver
> falsely.
>
> 1) Then does that mean if nothing specified, then Spark will use defaults
> specified in Spark config site
> <http://spark.apache.org/docs/latest/running-on-yarn.html> ?
>
> 2) Let me clarify, if i understood it correctly:
>
> (due to Yarn restrictions)
>
> *Yarn-cluster mode*:
> SparkAppMaster+Driver Memory < Yarn container max size allocation
> SparkExecutor Memory < Yarn container max size allocation
>
> *Yarn-client mode* (assume Spark Job is launched from the map task):
> Driver memory is independent of any Yarn properties, only limited by
> machines memory.
> SparkAppMaster Memory < Yarn container max size allocation
> SparkExecutor Memory < Yarn container max size allocation
>
> Did i get it correctly ?
>
> 3) Any resource for Spark components memory calculations for Yarn cluster
> ? (other than this site which describes default config values
> http://spark.apache.org/docs/latest/running-on-yarn.html )
>
> Thanks.
>
> On Sat, Nov 12, 2016 at 12:24 AM, Sean Owen <so...@cloudera.com> wrote:
>
> If you're pointing at the 336MB, then it's not really related any of the
> items you cite here. This is the memory managed internally by MemoryStore.
> The blog post refers to the legacy memory manager. You can see a bit of how
> it works in the code, but this is the sum of the on-heap and off-heap
> memory it can manage. See the memory config docs, however, to understand
> what user-facing settings you can make; you don't really need to workk
> about this value.
>
> mapreduce settings are irrelevant to Spark.
> Spark doesn't pay attention to the YARN settings, but YARN does. It
> enforces them, yes. It is not exempt from YARN.
>
> 896MB is correct there. yarn-client mode does not ignore driver
> properties, no.
>
> On Sat, Nov 12, 2016 at 2:18 AM Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
>
> Hi,
>
> Spark website <http://spark.apache.org/docs/latest/running-on-yarn.html> 
> indicates
> default spark properties as like this:
> I did not override any properties in spark-defaults.conf file, but when I
> launch Spark in YarnClient mode:
>
> spark.driver.memory 1g
> spark.yarn.am.memory 512m
> spark.yarn.am.memoryOverhead : max(spark.yarn.am.memory * 0.10, 384m)
> spark.yarn.driver.memoryOverhead : max(spark.driver.memory * 0.10, 384m)
>
> I launch Spark job via SparkLauncher#startApplication() in *Yarn-client
> mode from the Map task of Hadoop job*.
>
> *My cluster settings*:
> yarn.scheduler.minimum-allocation-mb 256
> yarn.scheduler.maximum-allocation-mb 204

Re: SparkDriver memory calculation mismatch

2016-11-12 Thread Elkhan Dadashov
@Sean Owen,

Thanks for your reply.

I put the wrong link to the blog post. Here is the correct link
<https://www.altiscale.com/blog/tips-and-tricks-for-running-spark-on-hadoop-part-4-memory-settings/>
which describes Spark Memory settings on Yarn. I guess they have misused
the terms Spark driver/BlockManager, and explained memory usage of driver
falsely.

1) Then does that mean if nothing specified, then Spark will use defaults
specified in Spark config site
<http://spark.apache.org/docs/latest/running-on-yarn.html> ?

2) Let me clarify, if i understood it correctly:

(due to Yarn restrictions)

*Yarn-cluster mode*:
SparkAppMaster+Driver Memory < Yarn container max size allocation
SparkExecutor Memory < Yarn container max size allocation

*Yarn-client mode* (assume Spark Job is launched from the map task):
Driver memory is independent of any Yarn properties, only limited by
machines memory.
SparkAppMaster Memory < Yarn container max size allocation
SparkExecutor Memory < Yarn container max size allocation

Did i get it correctly ?

3) Any resource for Spark components memory calculations for Yarn cluster ?
(other than this site which describes default config values
http://spark.apache.org/docs/latest/running-on-yarn.html )

Thanks.

On Sat, Nov 12, 2016 at 12:24 AM, Sean Owen <so...@cloudera.com> wrote:

If you're pointing at the 336MB, then it's not really related any of the
items you cite here. This is the memory managed internally by MemoryStore.
The blog post refers to the legacy memory manager. You can see a bit of how
it works in the code, but this is the sum of the on-heap and off-heap
memory it can manage. See the memory config docs, however, to understand
what user-facing settings you can make; you don't really need to workk
about this value.

mapreduce settings are irrelevant to Spark.
Spark doesn't pay attention to the YARN settings, but YARN does. It
enforces them, yes. It is not exempt from YARN.

896MB is correct there. yarn-client mode does not ignore driver properties,
no.

On Sat, Nov 12, 2016 at 2:18 AM Elkhan Dadashov <elkhan8...@gmail.com>
wrote:

Hi,

Spark website <http://spark.apache.org/docs/latest/running-on-yarn.html>
indicates
default spark properties as like this:
I did not override any properties in spark-defaults.conf file, but when I
launch Spark in YarnClient mode:

spark.driver.memory 1g
spark.yarn.am.memory 512m
spark.yarn.am.memoryOverhead : max(spark.yarn.am.memory * 0.10, 384m)
spark.yarn.driver.memoryOverhead : max(spark.driver.memory * 0.10, 384m)

I launch Spark job via SparkLauncher#startApplication() in *Yarn-client
mode from the Map task of Hadoop job*.

*My cluster settings*:
yarn.scheduler.minimum-allocation-mb 256
yarn.scheduler.maximum-allocation-mb 2048
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 640
mapreduce.map.java.opts -Xmx400m
yarn.app.mapreduce.am.command-opts -Xmx448m

*Logs of Spark job*:

INFO Client: Verifying our application has not requested more than the
maximum memory capability of the cluster (2048 MB per container)
INFO Client: Will allocate *AM container*, with 896 MB memory including 384
MB overhead

INFO MemoryStore: MemoryStore started with capacity 366.3 MB

./application_1478727394310_0005/container_1478727394310_0005_01_02/stderr:INFO:
16/11/09 14:18:42 INFO BlockManagerMasterEndpoint: Registering block
manager :57246 with *366.3* MB RAM, BlockManagerId(driver,
, 57246)

*Questions*:
1) How is driver memory calculated ?

How did Spark decide for 366 MB for driver based on properties described
above ?

I thought the memory allocation is based on this formula (
https://www.altiscale.com/blog/spark-on-hadoop/ ):

"Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction ,where
memoryFraction=0.6, and safetyFraction=0.9. This is 1024MB x 0.6 x 0.9 =
552.96MB. However, 552.96MB is a little larger than the value as shown in
the log. This is because of the runtime overhead imposed by Scala, which is
usually around 3-7%, more or less. If you do the calculation using 982MB x
0.6 x 0.9, (982MB being approximately 4% of 1024) then you will derive the
number 530.28MB, which is what is indicated in the log file after rounding
up to 530.30MB."

2) If Spark job is launched from the Map task via
SparkLauncher#startApplication() will driver memory respect
(mapreduce.map.memory.mb and mapreduce.map.java.opts) OR
(yarn.scheduler.maximum-allocation-mb) when launching Spark Job as child
process ?

The confusion is, as SparkSubmit is a new JVM process - because it is
launched as child process of the map task, and it does not depend on Yarn
configs. But not obeying any limits (if this is the case), will make things
tricky on NodeManager reporting back memory usage.

3) Is this correct formula for calculating AM memory ?

For AM it matches to this formula calculation (
https://www.altiscale.com/blog/spark-on-hadoop/ ):how much memory to
allocate to the AM: amMemory + a

Exception not failing Python applications (in yarn client mode) - SparkLauncher says app succeeded, where app actually has failed

2016-11-11 Thread Elkhan Dadashov
Hi,

*Problem*:
Spark job fails, but RM page says the job succeeded, also

appHandle = sparkLauncher.startApplication()
...

appHandle.getState() returns Finished state - which indicates The
application finished with a successful status, whereas the Spark job
actually failed.

*Environment*: Macintosh (El Capitan), Hadoop 2.7.2, Spark 2.0,
SparkLauncher 2.0.1

I have Spark job (pagerank.py) running in yarn-client mode.

*Reason of job failure*: The job fails because dependency package
pagerank.zip is missing.

*Related Jira (which indicate that bug is fixed)*:
https://issues.apache.org/jira/browse/SPARK-7736 - this was in Yarn-cluster
mode, now i face this issue in yarn-client mode.
https://issues.apache.org/jira/browse/SPARK-9416 (duplicate)

I faced same issue last year with SparkLauncher (spark-launcher_2.11) 1.4.0
version, then Marcelo had pull request which fixed the issue, and it was
working at that time (after Marcelo's fix) for yarn-cluster mode.

*Description*:
I'm launching Spark job via SparkLauncher#startApplication(),
1) in the RM page, it says the job succeeded, even though the Spark job has
failed.
2) in the container logs, i see that appHandle.getState() returned Finished
state - which also means The application finished with a successful status.

But in the same map container log lines I see that *the job is actually
failed (*I launched Spark job from the map task*)*:

493 INFO: ImportError: ('No module named pagerank', , ('pagerank',))
557 INFO: ImportError: ('No module named pagerank', , ('pagerank',))
591 INFO: ImportError: ('No module named pagerank', , ('pagerank',))
655 INFO: ImportError: ('No module named pagerank', , ('pagerank',))
659 INFO: 16/11/11 18:25:37 ERROR TaskSetManager: Task 0 in stage 0.0
failed 4 times; aborting job
665 INFO: 16/11/11 18:25:37 INFO DAGScheduler: ShuffleMapStage 0 (distinct
at
/private/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache//appcache/application_1478901028064_0016/container_1478901028064_0016_01_02/pag
erank.py:52) failed in 3.221 s
667 INFO: 16/11/11 18:25:37 INFO DAGScheduler: *Job 0 failed*: collect at
/private/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache//appcache/application_1478901028064_0016/container_1478901028064_0016_01_02/pagerank.
py:68, took 3.303328 s
681 INFO: py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
683 INFO: : org.apache.spark.SparkException: *Job aborted due to stage
failure*: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost
task 0.3 in stage 0.0 (TID 3, ):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
705 INFO: ImportError: ('No module named pagerank', , ('pagerank',))
745 INFO: at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
757 INFO: at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
759 INFO: at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
763 INFO: at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
841 INFO: ImportError: ('No module named pagerank', , ('pagerank',))

887 INFO: Spark job with app id: application_1478901028064_0017, *State
changed to: FINISHED* - The application finished with a successful status.

And here are the log lines from the Spark job container:
16/11/11 18:25:37 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 2)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File
"/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache//appcache/application_1478901028064_0017/container_1478901028064_0017_01_02/pyspark.zip/pyspark/worker.py",
line 161, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File
"/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache//appcache/application_1478901028064_0017/container_1478901028064_0017_01_02/pyspark.zip/pyspark/worker.py",
line 54, in read_command
command = serializer._read_with_length(file)
File
"/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache//appcache/application_1478901028064_0017/container_1478901028064_0017_01_02/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length
return self.loads(obj)
File
"/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache//appcache/application_1478901028064_0017/container_1478901028064_0017_01_02/pyspark.zip/pyspark/serializers.py",
line 422, in loads
return pickle.loads(obj)
File

SparkDriver memory calculation mismatch

2016-11-11 Thread Elkhan Dadashov
Hi,

Spark website 
indicates
default spark properties as like this:
I did not override any properties in spark-defaults.conf file, but when I
launch Spark in YarnClient mode:

spark.driver.memory 1g
spark.yarn.am.memory 512m
spark.yarn.am.memoryOverhead : max(spark.yarn.am.memory * 0.10, 384m)
spark.yarn.driver.memoryOverhead : max(spark.driver.memory * 0.10, 384m)

I launch Spark job via SparkLauncher#startApplication() in *Yarn-client
mode from the Map task of Hadoop job*.

*My cluster settings*:
yarn.scheduler.minimum-allocation-mb 256
yarn.scheduler.maximum-allocation-mb 2048
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 640
mapreduce.map.java.opts -Xmx400m
yarn.app.mapreduce.am.command-opts -Xmx448m

*Logs of Spark job*:

INFO Client: Verifying our application has not requested more than the
maximum memory capability of the cluster (2048 MB per container)
INFO Client: Will allocate *AM container*, with 896 MB memory including 384
MB overhead

INFO MemoryStore: MemoryStore started with capacity 366.3 MB

./application_1478727394310_0005/container_1478727394310_0005_01_02/stderr:INFO:
16/11/09 14:18:42 INFO BlockManagerMasterEndpoint: Registering block
manager :57246 with *366.3* MB RAM, BlockManagerId(driver,
, 57246)

*Questions*:
1) How is driver memory calculated ?

How did Spark decide for 366 MB for driver based on properties described
above ?

I thought the memory allocation is based on this formula (
https://www.altiscale.com/blog/spark-on-hadoop/ ):

"Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction ,where
memoryFraction=0.6, and safetyFraction=0.9. This is 1024MB x 0.6 x 0.9 =
552.96MB. However, 552.96MB is a little larger than the value as shown in
the log. This is because of the runtime overhead imposed by Scala, which is
usually around 3-7%, more or less. If you do the calculation using 982MB x
0.6 x 0.9, (982MB being approximately 4% of 1024) then you will derive the
number 530.28MB, which is what is indicated in the log file after rounding
up to 530.30MB."

2) If Spark job is launched from the Map task via
SparkLauncher#startApplication() will driver memory respect
(mapreduce.map.memory.mb and mapreduce.map.java.opts) OR
(yarn.scheduler.maximum-allocation-mb) when launching Spark Job as child
process ?

The confusion is, as SparkSubmit is a new JVM process - because it is
launched as child process of the map task, and it does not depend on Yarn
configs. But not obeying any limits (if this is the case), will make things
tricky on NodeManager reporting back memory usage.

3) Is this correct formula for calculating AM memory ?

For AM it matches to this formula calculation (
https://www.altiscale.com/blog/spark-on-hadoop/ ):how much memory to
allocate to the AM: amMemory + amMemoryOverhead
amMemoryOverhead is set to 384MB via spark.yarn.driver.memoryOverhead.
args.amMemory is fixed at 512MB by Spark when it’s running in yarn-client
mode. Adding 384MB of overhead to 512MB provides the 896MB figure requested
by Spark.

4) For Spark Yarn-client mode, are all spark.driver properties ignored, and
only spark.yarn.am properties used ?

Thanks.


appHandle.kill(), SparkSubmit Process, JVM questions related to SparkLauncher design and Spark Driver

2016-11-11 Thread Elkhan Dadashov
Few more questions to Marcelo.

Sorry Marcelo, for very long question list. I'd really appreciate your kind
help and answer to these questions in order to fully understand design
decision and architecture you have in mind while implementing very helpful
SparkLauncher.

*Scenario*: Spark job is launched via SparkLauncher#startApplication()
inside the Map task in Cluster. So i launch Hadoop Map only job, and inside
that Map job I launch Spark job.

These are the related processes launched when Spark job is launched from
the map task:

40588 YarnChild (this is map task container process)

40550 MRAppMaster(this is MR APP MASTER container)


*Spark Related processes:*

40602 SparkSubmit

40875 CoarseGrainedExecutorBackend

40846 CoarseGrainedExecutorBackend
40815 ExecutorLauncher

When Spark app is started via SparkLauncher#startApplication(), Spark
driver (inside SparkSubmit) is started as child process - new JVM process
started.

1) This child process lives outside map task YARN Container & JVM process,
but on the same machine, right ?
 Child process (SparkSubmit) will have its own JVM, right ?

 As shown in the process list above SparkSubmit is separate process.

2) As everything is external to the JVM of map task - Spark app/driver
(inside SparkSubmit) will be running in its own JVM on the same machine
where Map container is running, you use Process API offers the destroy()
and destroyForcibly() methods, which apply the appropriate platform
specific process stopping procedures.

*In order to keep parent-child process tie, and make sure child process
will die when parent process dies or killed (even not gracefully), you used
this technique*:

You created a thread with an server-side socket in accept mode on the
parent with port. When the child starts, pass that port number as a
parameter (environment variable). Have it create a thread and open that
socket. The have the thread sit on the socket forever. If the connection
ever drops, then the child exit.

Marcelo, *please correct me if i am wrong*. Is this how you make sure child
process is also killed when parent process is killed ?

3) Let's say I kill the map task forcefully or using hadoopClient kill job
by jobId, which spans Spark job using appHandle.startApplication(),

a) Spark Driver (SparkSubmit process) will also be killed , right ?
Even if the code will not have a chance call appHandle.stop() and
appHandle.kill(), child process will die too because of parent-child
relationship i described above. Is this correct ?

b) Assuming (3a) is correct, driver was killed due to parent-child
relationship, *without* appHandle.stop() and appHandle.kill() commands
executed, will Executors clean the environment (remove temp files)
before stopping ?

4) To add another level of improvement, is it good idea to attach
ShutDownHook (Runtime.getRuntime().addShutdownHook(new ShutdownThread());)
to the map task, and inside that call these 2 functions:

 appHandle.stop();
 apphandle.kill();

Thanks.

P.S: *In the below thread you will find design decisions of
appHandle.kill() implementation replied by Marcelo  (thanks a lot) - which
is interesting to know.*

On Thu, Nov 10, 2016 at 9:22 AM Marcelo Vanzin <van...@cloudera.com> wrote:

> Hi Elkhan,
>
> I'd prefer if these questions were asked in the mailing list.
>
> The launcher code cannot call YARN APIs directly because Spark
> supports more than just YARN. So its API and implementation has to be
> cluster-agnostic.
>
> As for kill, this is what the docs say:
>
> """
> This will not send a {@link #stop()} message to the application, so
> it's recommended that users first try to
> stop the application cleanly and only resort to this method if that fails.
> """
>
> So if you want to stop the application first, call stop().
>
>
> On Thu, Nov 10, 2016 at 12:55 AM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > Hi Marcelo,
> >
> > I have few more questions related to SparkLauncher. Will be glad and
> > thankful if you could answer them.
> >
> > It seems SparkLauncher Yarn-client or Yarn-Cluster deploy mode does not
> > matter much, as even in yarn-cluster mode  the client that launches the
> > application must remain alive for the duration of the application (or
> until
> > the app handle is disconnected) which is described in LauncherServer.java
> > JavaDoc.
> >
> > 1) In yarn-cluster mode, if the client dies, then will only the appHandle
> > will be lost, or the Spark application will also die ?
> >
> > 2) May i know why did you prefer implementing appHandle.kill() with
> killing
> > process instead of let's say :
> >
> > a) yarn application -kill 
> > b) ./bin/spark-class org.apache.spark.deploy.Client kill 
> >
> > 3) If Spark Driver is killed (by killing the process, not gracefully),
> will
> > Executors clean the environment (remove temp files) ?
> >
> > Thanks a lot.
> >
> >
>
>
>
> --
> Marcelo
>


Re: SparkLauncer 2.0.1 version working incosistently in yarn-client mode

2016-11-10 Thread Elkhan Dadashov
Thanks Marcelo.

I changed the code using CountDownLatch, and it works as expected.

...final CountDownLatch countDownLatch = new CountDownLatch(1);
SparkAppListener sparkAppListener = new SparkAppListener(countDownLatch);
SparkAppHandle appHandle =
sparkLauncher.startApplication(sparkAppListener);Thread
sparkAppListenerThread = new Thread(sparkAppListener);
sparkAppListenerThread.start();long timeout = 120;
countDownLatch.await(timeout, TimeUnit.SECONDS);
...

private static class SparkAppListener implements
SparkAppHandle.Listener, Runnable {
private static final Log log =
LogFactory.getLog(SparkAppListener.class);
private final CountDownLatch countDownLatch;
public SparkAppListener(CountDownLatch countDownLatch) {
this.countDownLatch = countDownLatch;
}
@Override
public void stateChanged(SparkAppHandle handle) {
String sparkAppId = handle.getAppId();
State appState = handle.getState();
if (sparkAppId != null) {
log.info("Spark job with app id: " + sparkAppId + ",\t
State changed to: " + appState + " - "
+ SPARK_STATE_MSG.get(appState));
} else {
log.info("Spark job's state changed to: " + appState +
" - " + SPARK_STATE_MSG.get(appState));
}
if (appState != null && appState.isFinal()) {
countDownLatch.countDown();
}
}
@Override
public void infoChanged(SparkAppHandle handle) {}
@Override
public void run() {}
}


On Mon, Nov 7, 2016 at 9:46 AM Marcelo Vanzin <van...@cloudera.com> wrote:

> On Sat, Nov 5, 2016 at 2:54 AM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > while (appHandle.getState() == null || !appHandle.getState().isFinal()) {
> > if (appHandle.getState() != null) {
> > log.info("while: Spark job state is : " + appHandle.getState());
> > if (appHandle.getAppId() != null) {
> > log.info("\t App id: " + appHandle.getAppId() + "\tState: "
> +
> > appHandle.getState());
> > }
> > }
> > }
>
> This is a ridiculously expensive busy loop, even more so if you
> comment out the log lines. Use listeners, or at least sleep a little
> bit every once in a while. You're probably starving other processes /
> threads of cpu.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


SparkLauncer 2.0.1 version working incosistently in yarn-client mode

2016-11-05 Thread Elkhan Dadashov
Hi,

I'm running Spark 2.0.1 version with Spark Launcher 2.0.1 version on Yarn
cluster. I launch map task which spawns Spark job via
SparkLauncher#startApplication().

Deploy mode is yarn-client. I'm running in Mac laptop.

I have this snippet of code:

SparkAppHandle appHandle = sparkLauncher.startApplication();
while (appHandle.getState() == null || !appHandle.getState().isFinal()) {
if (appHandle.getState() != null) {
   * log.info ("while: Spark job state is : " +
appHandle.getState());*
if (appHandle.getAppId() != null) {
log.info("\t App id: " + appHandle.getAppId() + "\tState: " +
appHandle.getState());
}
}
}

The above snippet of code works fine, both spark job and the map task which
spawns that Spark job successfully completes.

But if i comment out the red highlighted line, then the Spark job launches
and finishes successfully, but the map task hangs for a while (in Running
state) and then fails with the exception below.

I run exact same code in exact same environment except that one line
commented out.

When the highlighted line is commented out, I even see the 2nd log line in
the stderr either, it seems appHandle hook never returns back anything
(neither app id nor app state), even though spark application starts, runs
and finishes successfully. Inside the same stderr, i can see Spark job
related logs, and spark job results printed, and application report
indicating status.

You can see the exception below (this is from the stderr of the mapper
container which launches Spark job):
---

INFO: Communication exception: java.net.ConnectException: Call From
/10.3.8.118 to :53567 failed on connection
exception: java.net.ConnectException: Connection refused;

Caused by: java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)

at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)

at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)

at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)

at
org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)

at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)

at org.apache.hadoop.ipc.Client.call(Client.java:1451)

... 5 more

---

Nov 05, 2016 2:41:54 AM org.apache.hadoop.ipc.Client handleConnectionFailure

INFO: Retrying connect to server: /10.3.8.118:53567. Already
tried 9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
MILLISECONDS)

Nov 05, 2016 2:41:54 AM org.apache.hadoop.mapred.Task run

INFO: Communication exception: java.net.ConnectException: Call From
/10.3.8.118 to :53567 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:  http://wiki.apache.org/hadoop/ConnectionRefused

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)

at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)

at org.apache.hadoop.ipc.Client.call(Client.java:1479)

at org.apache.hadoop.ipc.Client.call(Client.java:1412)

at
org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242)

at com.sun.proxy.$Proxy9.ping(Unknown Source)

at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:767)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.net.ConnectException: Connection refused

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)

at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)

at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)

at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)

at
org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)

at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)

at org.apache.hadoop.ipc.Client.call(Client.java:1451)

... 5 more

---

Nov 05, 2016 2:41:54 AM org.apache.hadoop.mapred.Task 

Re: Can i get callback notification on Spark job completion ?

2016-10-28 Thread Elkhan Dadashov
Hi Marcelo,

Thanks for the reply.

But that means SparkAppHandle need to stay alive until Spark job completes.

In my case  Iaunch Spark job from the delegator Map task in cluster. That
means the map task container need to stay alive, and wait until Spark Job
completes.

But if the map task will finish before the Spark job finishes, that means
SparkLauncher will go away. if the SparkLauncher handle goes away, then I
lose the ability to track the app's state, right ?

I'm investigating if there is a way to know Spark job completion (without
Spark Job History Server) in asynchronous manner.

Like get a callback on MapReduce job completion - getting a notification
instead of polling.

*[Another option]*
According to Spark docs
<http://spark.apache.org/docs/latest/monitoring.html#rest-api> Spark
Metrics can be configured with different sinks.

I wonder whether it is possible to determine the job completion status from
these metrics.

Do Spark metrics provide job state information for each job id too ?

On Fri, Oct 28, 2016 at 11:05 AM Marcelo Vanzin <van...@cloudera.com> wrote:

If you look at the "startApplication" method it takes listeners as
parameters.

On Fri, Oct 28, 2016 at 10:23 AM, Elkhan Dadashov <elkhan8...@gmail.com>
wrote:
> Hi,
>
> I know that we can use SparkAppHandle (introduced in SparkLauncher version
>>=1.6), and lt the delegator map task stay alive until the Spark job
> finishes. But i wonder, if this can be done via callback notification
> instead of polling.
>
> Can i get callback notification on Spark job completion ?
>
> Similar to Hadoop, get a callback on MapReduce job completion - getting a
> notification instead of polling.
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value. Can be retrieved from notification URL
> both the JOB_ID and JOB_STATUS.
>
> ...
> Configuration conf = this.getConf();
> // Set the callback parameters
> conf.set("job.end.notification.url",
> "
https://hadoopi.wordpress.com/api/hadoop/notification/$jobId?status=$jobStatus
");
> ...
> // Submit your job in background
> job.submit();
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value:
>
> https://
/api/hadoop/notification/job_1379509275868_0002?status=SUCCEEDED
>
> Reference:
>
https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
>
> Thanks.



--
Marcelo


Can i get callback notification on Spark job completion ?

2016-10-28 Thread Elkhan Dadashov
Hi,

I know that we can use SparkAppHandle (introduced in SparkLauncher version
>=1.6), and lt the delegator map task stay alive until the Spark job
finishes. But i wonder, if this can be done via callback notification
instead of polling.

Can i get callback notification on Spark job completion ?

Similar to Hadoop, get a callback on MapReduce job completion - getting a
notification instead of polling.

At job completion, an HTTP request will be sent to
“job.end.notification.url” value. Can be retrieved from notification URL
both the JOB_ID and JOB_STATUS.

...
Configuration conf = this.getConf();
// Set the callback parameters
conf.set("job.end.notification.url", "
https://hadoopi.wordpress.com/api/hadoop/notification/$*jobId*?status=$
*jobStatus*");
...
// Submit your job in background
job.submit();

At job completion, an HTTP request will be sent to
“job.end.notification.url” value:

https://
/api/hadoop/notification/job_1379509275868_0002?status=SUCCEEDED

Reference:
https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/


Thanks.


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-28 Thread Elkhan Dadashov
I figured out JOB id returned from sparkAppHandle.getAppId(), is unique
ApplicationId which looks like these:

for Local mode Spark env: Local-1477184581895
For Distributed Spark mode: Application_1477504900821_0005

ApplicationId represents the globally unique identifier for an application.

The globally unique nature of the identifier is achieved by using the cluster
timestamp i.e. start-time of the ResourceManager along with a monotonically
increasing counter for the application.




On Sat, Oct 22, 2016 at 5:18 PM Elkhan Dadashov <elkhan8...@gmail.com>
wrote:

> I found answer regarding logging in the JavaDoc of SparkLauncher:
>
> "Currently, all applications are launched as child processes. The child's
> stdout and stderr are merged and written to a logger (see
> java.util.logging)."
>
> One last question. sparkAppHandle.getAppId() - does this function
> return org.apache.hadoop.mapred.*JobID* which makes it easy tracking in
> Yarn ? Or is appId just the Spark app name we assign ?
>
> If it is JobID, then even if the SparkLauncher handle goes away, by
> talking directly to the cluster manager, i can get Job details.
>
> Thanks.
>
> On Sat, Oct 22, 2016 at 4:53 PM Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
>
> Thanks, Marcelo.
>
> One more question regarding getting logs.
>
> In previous implementation of SparkLauncer we could read logs from :
>
> sparkLauncher.getInputStream()
> sparkLauncher.getErrorStream()
>
> What is the recommended way of getting logs and logging of Spark execution
> while using sparkLauncer#startApplication() ?
>
> Thanks.
>
> On Tue, Oct 18, 2016 at 3:07 PM Marcelo Vanzin <van...@cloudera.com>
> wrote:
>
> On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > Does my map task need to wait until Spark job finishes ?
>
> No...
>
> > Or is there any way, my map task finishes after launching Spark job, and
> I
> > can still query and get status of Spark job outside of map task (or
> failure
> > reason, if it has failed) ? (maybe by querying Spark job id ?)
>
> ...but if the SparkLauncher handle goes away, then you lose the
> ability to track the app's state, unless you talk directly to the
> cluster manager.
>
> > I guess also if i want my Spark job to be killed, if corresponding
> delegator
> > map task is killed, that means my map task needs to stay alive, so i
> still
> > have SparkAppHandle reference ?
>
> Correct, unless you talk directly to the cluster manager.
>
> --
> Marcelo
>
>


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-22 Thread Elkhan Dadashov
I found answer regarding logging in the JavaDoc of SparkLauncher:

"Currently, all applications are launched as child processes. The child's
stdout and stderr are merged and written to a logger (see
java.util.logging)."

One last question. sparkAppHandle.getAppId() - does this function
return org.apache.hadoop.mapred.*JobID* which makes it easy tracking in
Yarn ? Or is appId just the Spark app name we assign ?

If it is JobID, then even if the SparkLauncher handle goes away, by talking
directly to the cluster manager, i can get Job details.

Thanks.

On Sat, Oct 22, 2016 at 4:53 PM Elkhan Dadashov <elkhan8...@gmail.com>
wrote:

> Thanks, Marcelo.
>
> One more question regarding getting logs.
>
> In previous implementation of SparkLauncer we could read logs from :
>
> sparkLauncher.getInputStream()
> sparkLauncher.getErrorStream()
>
> What is the recommended way of getting logs and logging of Spark execution
> while using sparkLauncer#startApplication() ?
>
> Thanks.
>
> On Tue, Oct 18, 2016 at 3:07 PM Marcelo Vanzin <van...@cloudera.com>
> wrote:
>
> On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > Does my map task need to wait until Spark job finishes ?
>
> No...
>
> > Or is there any way, my map task finishes after launching Spark job, and
> I
> > can still query and get status of Spark job outside of map task (or
> failure
> > reason, if it has failed) ? (maybe by querying Spark job id ?)
>
> ...but if the SparkLauncher handle goes away, then you lose the
> ability to track the app's state, unless you talk directly to the
> cluster manager.
>
> > I guess also if i want my Spark job to be killed, if corresponding
> delegator
> > map task is killed, that means my map task needs to stay alive, so i
> still
> > have SparkAppHandle reference ?
>
> Correct, unless you talk directly to the cluster manager.
>
> --
> Marcelo
>
>


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-22 Thread Elkhan Dadashov
Thanks, Marcelo.

One more question regarding getting logs.

In previous implementation of SparkLauncer we could read logs from :

sparkLauncher.getInputStream()
sparkLauncher.getErrorStream()

What is the recommended way of getting logs and logging of Spark execution
while using sparkLauncer#startApplication() ?

Thanks.

On Tue, Oct 18, 2016 at 3:07 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
> > Does my map task need to wait until Spark job finishes ?
>
> No...
>
> > Or is there any way, my map task finishes after launching Spark job, and
> I
> > can still query and get status of Spark job outside of map task (or
> failure
> > reason, if it has failed) ? (maybe by querying Spark job id ?)
>
> ...but if the SparkLauncher handle goes away, then you lose the
> ability to track the app's state, unless you talk directly to the
> cluster manager.
>
> > I guess also if i want my Spark job to be killed, if corresponding
> delegator
> > map task is killed, that means my map task needs to stay alive, so i
> still
> > have SparkAppHandle reference ?
>
> Correct, unless you talk directly to the cluster manager.
>
> --
> Marcelo
>


Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-18 Thread Elkhan Dadashov
Hi,

Does the delegator map task of SparkLauncher need to stay alive until Spark
job finishes ?

1)
Currently, I have mapper tasks, which launches Spark job via
SparkLauncer#startApplication()

Does my map task need to wait until Spark job finishes ?

Or is there any way, my map task finishes after launching Spark job, and I
can still query and get status of Spark job outside of map task (or failure
reason, if it has failed) ? (maybe by querying Spark job id ?)

2)
I guess also if i want my Spark job to be killed, if corresponding
delegator map task is killed, that means my map task needs to stay alive,
so i still have SparkAppHandle reference ?

Thanks.


Re: How does the # of tasks affect # of threads?

2015-08-04 Thread Elkhan Dadashov
 Science
 University of Delaware

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Regards,

 Connor Zanin
 Computer Science
 University of Delaware




-- 

Best regards,
Elkhan Dadashov


Re: SparkLauncher not notified about finished job - hangs infinitely.

2015-07-31 Thread Elkhan Dadashov
Nope, output stream of that subprocess should be spark.getInputStream()

According to Oracle Doc
https://docs.oracle.com/javase/8/docs/api/java/lang/Process.html#getOutputStream--
:

public abstract InputStream
https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html
 getInputStream()
Returns the input stream connected to the normal output of the subprocess.
The stream obtains data piped from the standard output of the process
represented by this Process object.

On Fri, Jul 31, 2015 at 10:10 AM, Ted Yu yuzhih...@gmail.com wrote:

 minor typo:

 bq. output (spark.getInputStream())

 Should be spark.getOutputStream()

 Cheers

 On Fri, Jul 31, 2015 at 10:02 AM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Hi Tomasz,

 *Answer to your 1st question*:

 Clear/read the error (spark.getErrorStream()) and output
 (spark.getInputStream()) stream buffers before you call spark.waitFor(), it
 would be better to clear/read them with 2 different threads. Then it should
 work fine.

 As Spark job is launched as subprocess, and according to Oracle
 documentation
 https://docs.oracle.com/javase/8/docs/api/java/lang/Process.html:

 By default, the created subprocess does not have its own terminal or
 console. All its standard I/O (i.e. stdin, stdout, stderr) operations will
 be redirected to the parent process, where they can be accessed via the
 streams obtained using the methodsgetOutputStream(), getInputStream(), and
 getErrorStream(). The parent process uses these streams to feed input to
 and get output from the subprocess. Because some native platforms only
 provide limited buffer size for standard input and output streams, failure
 to promptly write the input stream or read the output stream of the
 subprocess may cause the subprocess to block, or even deadlock.
 



 On Fri, Jul 31, 2015 at 2:45 AM, Tomasz Guziałek 
 tomasz.guzia...@humaninference.com wrote:

 I am trying to submit a JAR with Spark job into the YARN cluster from
 Java code. I am using SparkLauncher to submit SparkPi example:

 Process spark = new SparkLauncher()

 .setAppResource(C:\\spark-1.4.1-bin-hadoop2.6\\lib\\spark-examples-1.4.1-hadoop2.6.0.jar)
 .setMainClass(org.apache.spark.examples.SparkPi)
 .setMaster(yarn-cluster)
 .launch();
 System.out.println(Waiting for finish...);
 int exitCode = spark.waitFor();
 System.out.println(Finished! Exit code: + exitCode);

 There are two problems:

 1. While submitting in yarn-cluster mode, the application is
 successfully submitted to YARN and executes successfully (it is visible in
 the YARN UI, reported as SUCCESS and PI value is printed in the output).
 However, the submitting application is never notified that processing is
 finished - it hangs infinitely after printing Waiting to finish... The
 log of the container can be found here: http://pastebin.com/LscBjHQc
 2. While submitting in yarn-client mode, the application does not
 appear in YARN UI and the submitting application hangs at Waiting to
 finish... When hanging code is killed, the application shows up in YARN UI
 and it is reported as SUCCESS, but the output is empty (PI value is not
 printed out). The log of the container can be found here:
 http://pastebin.com/9KHi81r4

 I tried to execute the submitting application both with Oracle Java 8
 and 7.



 Any hints what might be wrong?



 Best regards,

 Tomasz




 --

 Best regards,
 Elkhan Dadashov





-- 

Best regards,
Elkhan Dadashov


Re: SparkLauncher not notified about finished job - hangs infinitely.

2015-07-31 Thread Elkhan Dadashov
Hi Tomasz,

*Answer to your 1st question*:

Clear/read the error (spark.getErrorStream()) and output
(spark.getInputStream()) stream buffers before you call spark.waitFor(), it
would be better to clear/read them with 2 different threads. Then it should
work fine.

As Spark job is launched as subprocess, and according to Oracle
documentation
https://docs.oracle.com/javase/8/docs/api/java/lang/Process.html:

By default, the created subprocess does not have its own terminal or
console. All its standard I/O (i.e. stdin, stdout, stderr) operations will
be redirected to the parent process, where they can be accessed via the
streams obtained using the methodsgetOutputStream(), getInputStream(), and
getErrorStream(). The parent process uses these streams to feed input to
and get output from the subprocess. Because some native platforms only
provide limited buffer size for standard input and output streams, failure
to promptly write the input stream or read the output stream of the
subprocess may cause the subprocess to block, or even deadlock.




On Fri, Jul 31, 2015 at 2:45 AM, Tomasz Guziałek 
tomasz.guzia...@humaninference.com wrote:

 I am trying to submit a JAR with Spark job into the YARN cluster from Java
 code. I am using SparkLauncher to submit SparkPi example:

 Process spark = new SparkLauncher()

 .setAppResource(C:\\spark-1.4.1-bin-hadoop2.6\\lib\\spark-examples-1.4.1-hadoop2.6.0.jar)
 .setMainClass(org.apache.spark.examples.SparkPi)
 .setMaster(yarn-cluster)
 .launch();
 System.out.println(Waiting for finish...);
 int exitCode = spark.waitFor();
 System.out.println(Finished! Exit code: + exitCode);

 There are two problems:

 1. While submitting in yarn-cluster mode, the application is
 successfully submitted to YARN and executes successfully (it is visible in
 the YARN UI, reported as SUCCESS and PI value is printed in the output).
 However, the submitting application is never notified that processing is
 finished - it hangs infinitely after printing Waiting to finish... The
 log of the container can be found here: http://pastebin.com/LscBjHQc
 2. While submitting in yarn-client mode, the application does not appear
 in YARN UI and the submitting application hangs at Waiting to finish...
 When hanging code is killed, the application shows up in YARN UI and it is
 reported as SUCCESS, but the output is empty (PI value is not printed out).
 The log of the container can be found here: http://pastebin.com/9KHi81r4

 I tried to execute the submitting application both with Oracle Java 8 and
 7.



 Any hints what might be wrong?



 Best regards,

 Tomasz




-- 

Best regards,
Elkhan Dadashov


Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
Thanks Corey for your answer,

Do you mean that final status : SUCCEEDED in terminal logs means that
YARN RM could clean the resources after the application has finished
(application finishing does not necessarily mean succeeded or failed) ?

With that logic it totally makes sense.

Basically the YARN logs does not say anything about the Spark job itself.
It just says that Spark job resources have been cleaned up after the job
completed and returned back to Yarn.

It would be great if Yarn logs could also say about the consequence of the
job, because the user is interested in more about the job final status.

Yarn related logs can be found in RM ,NM, DN, NN log files in detail.

Thanks again.

On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet cjno...@gmail.com wrote:

 Elkhan,

 What does the ResourceManager say about the final status of the job?
 Spark jobs that run as Yarn applications can fail but still successfully
 clean up their resources and give them back to the Yarn cluster. Because of
 this, there's a difference between your code throwing an exception in an
 executor/driver and the Yarn application failing. Generally you'll see a
 yarn application fail when there's a memory problem (too much memory being
 allocated or not enough causing executors to fail multiple times not
 allowing your job to finish).

 What I'm seeing from your post is that you had an exception in your
 application which was caught by the Spark framework which then proceeded to
 clean up the job and shut itself down- which it did successfully. When you
 aren't running in the Yarn modes, you aren't seeing any Yarn status that's
 telling you the Yarn application was successfully shut down, you are just
 seeing the failure(s) from your drivers/executors.



 On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Any updates on this bug ?

 Why Spark log results  Job final status does not match ? (one saying
 that job has failed, another stating that job has succeeded)

 Thanks.


 On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Hi all,

 While running Spark Word count python example with intentional mistake
 in *Yarn cluster mode*, Spark terminal states final status as
 SUCCEEDED, but log files state correct results indicating that the job
 failed.

 Why terminal log output  application log output contradict each other ?

 If i run same job on *local mode* then terminal logs and application
 logs match, where both state that job has failed to expected error in
 python script.

 More details: Scenario

 While running Spark Word count python example on *Yarn cluster mode*,
 if I make intentional error in wordcount.py by changing this line (I'm
 using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
 versions - which i tested):

 lines = sc.textFile(sys.argv[1], 1)

 into this line:

 lines = sc.textFile(*nonExistentVariable*,1)

 where nonExistentVariable variable was never created and initialized.

 then i run that example with this command (I put README.md into HDFS
 before running this command):

 *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

 The job runs and finishes successfully according the log printed in the
 terminal :
 *Terminal logs*:
 ...
 15/07/23 16:19:17 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:18 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:19 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:20 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:21 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: FINISHED)
 15/07/23 16:19:21 INFO yarn.Client:
  client token: N/A
  diagnostics: Shutdown hook called before final status was reported.
  ApplicationMaster host: 10.0.53.59
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1437693551439
  final status: *SUCCEEDED*
  tracking URL:
 http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
  user: edadashov
 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
 15/07/23 16:19:21 INFO util.Utils: Deleting directory
 /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

 But if look at log files generated for this application in HDFS - it
 indicates failure of the job with correct reason:
 *Application log files*:
 ...
 \00 stdout\00 179Traceback (most recent call last):
   File wordcount.py, line 32, in module
 lines = sc.textFile(nonExistentVariable,1)
 *NameError: name 'nonExistentVariable' is not defined*


 Why terminal output - final status: *SUCCEEDED , *is not matching
 application log results - failure of the job (NameError: name
 'nonExistentVariable' is not defined) ?

 Is this bug ? Is there Jira ticket related to this issue ? (Is someone
 assigned

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
But then how can we get to know if job is making progress in programmatic
way (Java) ?

Or if job has failed or succeeded ?

Is looking to application log files the only way knowing about job final
status (failed/succeeded) ?

Because when job fails Job History server does not have much info about the
job.

In my particular case, if there was error before any Job stage started,
then job history server will not have useful info about job progress or
status.

Then how can user track the Spark job status ?

I'm launching Spark jobs through SparkLauncher in Java, then it becomes
more difficult to know about the job status.

On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov elkhan8...@gmail.com
wrote:

 Thanks Corey for your answer,

 Do you mean that final status : SUCCEEDED in terminal logs means that
 YARN RM could clean the resources after the application has finished
 (application finishing does not necessarily mean succeeded or failed) ?

 With that logic it totally makes sense.

 Basically the YARN logs does not say anything about the Spark job itself.
 It just says that Spark job resources have been cleaned up after the job
 completed and returned back to Yarn.

 It would be great if Yarn logs could also say about the consequence of the
 job, because the user is interested in more about the job final status.

 Yarn related logs can be found in RM ,NM, DN, NN log files in detail.

 Thanks again.

 On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet cjno...@gmail.com wrote:

 Elkhan,

 What does the ResourceManager say about the final status of the job?
 Spark jobs that run as Yarn applications can fail but still successfully
 clean up their resources and give them back to the Yarn cluster. Because of
 this, there's a difference between your code throwing an exception in an
 executor/driver and the Yarn application failing. Generally you'll see a
 yarn application fail when there's a memory problem (too much memory being
 allocated or not enough causing executors to fail multiple times not
 allowing your job to finish).

 What I'm seeing from your post is that you had an exception in your
 application which was caught by the Spark framework which then proceeded to
 clean up the job and shut itself down- which it did successfully. When you
 aren't running in the Yarn modes, you aren't seeing any Yarn status that's
 telling you the Yarn application was successfully shut down, you are just
 seeing the failure(s) from your drivers/executors.



 On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Any updates on this bug ?

 Why Spark log results  Job final status does not match ? (one saying
 that job has failed, another stating that job has succeeded)

 Thanks.


 On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Hi all,

 While running Spark Word count python example with intentional mistake
 in *Yarn cluster mode*, Spark terminal states final status as
 SUCCEEDED, but log files state correct results indicating that the job
 failed.

 Why terminal log output  application log output contradict each other ?

 If i run same job on *local mode* then terminal logs and application
 logs match, where both state that job has failed to expected error in
 python script.

 More details: Scenario

 While running Spark Word count python example on *Yarn cluster mode*,
 if I make intentional error in wordcount.py by changing this line (I'm
 using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
 versions - which i tested):

 lines = sc.textFile(sys.argv[1], 1)

 into this line:

 lines = sc.textFile(*nonExistentVariable*,1)

 where nonExistentVariable variable was never created and initialized.

 then i run that example with this command (I put README.md into HDFS
 before running this command):

 *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

 The job runs and finishes successfully according the log printed in the
 terminal :
 *Terminal logs*:
 ...
 15/07/23 16:19:17 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:18 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:19 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:20 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:21 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: FINISHED)
 15/07/23 16:19:21 INFO yarn.Client:
  client token: N/A
  diagnostics: Shutdown hook called before final status was reported.
  ApplicationMaster host: 10.0.53.59
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1437693551439
  final status: *SUCCEEDED*
  tracking URL:
 http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
  user: edadashov
 15/07/23 16:19:21 INFO util.Utils: Shutdown hook

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
Thanks a lot for feedback, Marcelo.

I've filed a bug just now - SPARK-9416
https://issues.apache.org/jira/browse/SPARK-9416



On Tue, Jul 28, 2015 at 12:14 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 BTW this is most probably caused by this line in PythonRunner.scala:

 System.exit(process.waitFor())

 The YARN backend doesn't like applications calling System.exit().


 On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 This might be an issue with how pyspark propagates the error back to the
 AM. I'm pretty sure this does not happen for Scala / Java apps.

 Have you filed a bug?

 On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Thanks Corey for your answer,

 Do you mean that final status : SUCCEEDED in terminal logs means that
 YARN RM could clean the resources after the application has finished
 (application finishing does not necessarily mean succeeded or failed) ?

 With that logic it totally makes sense.

 Basically the YARN logs does not say anything about the Spark job
 itself. It just says that Spark job resources have been cleaned up after
 the job completed and returned back to Yarn.

 It would be great if Yarn logs could also say about the consequence of
 the job, because the user is interested in more about the job final status.

 Yarn related logs can be found in RM ,NM, DN, NN log files in detail.

 Thanks again.

 On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet cjno...@gmail.com wrote:

 Elkhan,

 What does the ResourceManager say about the final status of the job?
 Spark jobs that run as Yarn applications can fail but still successfully
 clean up their resources and give them back to the Yarn cluster. Because of
 this, there's a difference between your code throwing an exception in an
 executor/driver and the Yarn application failing. Generally you'll see a
 yarn application fail when there's a memory problem (too much memory being
 allocated or not enough causing executors to fail multiple times not
 allowing your job to finish).

 What I'm seeing from your post is that you had an exception in your
 application which was caught by the Spark framework which then proceeded to
 clean up the job and shut itself down- which it did successfully. When you
 aren't running in the Yarn modes, you aren't seeing any Yarn status that's
 telling you the Yarn application was successfully shut down, you are just
 seeing the failure(s) from your drivers/executors.



 On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Any updates on this bug ?

 Why Spark log results  Job final status does not match ? (one saying
 that job has failed, another stating that job has succeeded)

 Thanks.


 On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov elkhan8...@gmail.com
  wrote:

 Hi all,

 While running Spark Word count python example with intentional
 mistake in *Yarn cluster mode*, Spark terminal states final status
 as SUCCEEDED, but log files state correct results indicating that the job
 failed.

 Why terminal log output  application log output contradict each
 other ?

 If i run same job on *local mode* then terminal logs and application
 logs match, where both state that job has failed to expected error in
 python script.

 More details: Scenario

 While running Spark Word count python example on *Yarn cluster mode*,
 if I make intentional error in wordcount.py by changing this line (I'm
 using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
 versions - which i tested):

 lines = sc.textFile(sys.argv[1], 1)

 into this line:

 lines = sc.textFile(*nonExistentVariable*,1)

 where nonExistentVariable variable was never created and initialized.

 then i run that example with this command (I put README.md into HDFS
 before running this command):

 *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

 The job runs and finishes successfully according the log printed in
 the terminal :
 *Terminal logs*:
 ...
 15/07/23 16:19:17 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:18 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:19 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:20 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:21 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: FINISHED)
 15/07/23 16:19:21 INFO yarn.Client:
  client token: N/A
  diagnostics: Shutdown hook called before final status was reported.
  ApplicationMaster host: 10.0.53.59
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1437693551439
  final status: *SUCCEEDED*
  tracking URL:
 http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
  user: edadashov
 15/07/23 16:19:21 INFO util.Utils: Shutdown hook

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
I run Spark in yarn-cluster mode. and yes , log aggregation enabled. In
Yarn aggregated logs i can the job status correctly.

The issue is Yarn Client logs (which is written to stdout in terminal)
states that job has succeeded even though the job has failed.

As user is not testing if Yarn RM successfully cleared all resources, but
the Spark job final status.

If Spark job fails then Spark Job server does not give any useful
information about job failure.

As Marcelo stated it is bug in Spark Python jobs (PythonRunner.scala
maybe). I've created bug report - SPARK-9416
https://issues.apache.org/jira/browse/SPARK-9416.

@Marcelo
*Question1*:
Do you know why launching Spark job through SparkLauncher in Java, stdout
logs (i.e., INFO Yarn.Client) are written into error stream
(spark.getErrorStream()) instead of output stream ?

Output stream (spark.getInputStream()) is always empty.

Process spark = new
SparkLauncher().setSparkHome(/home/edadashov/tools/myspark/spark)

.setAppName(sparkScriptName).setMaster(yarn-cluster)

.setAppResource(sparkScriptPath.toString()).addAppArgs(params)
.addPyFile(sparkScriptPath.toString())

.addPyFile(dependencyPackagePath.toString()).launch();

Then i need to clear/read the buffered streams, in order the process not to
enter deadlock. According to Oracle doc
https://docs.oracle.com/javase/8/docs/api/java/lang/Process.html#getOutputStream--
:

By default, the created subprocess does not have its own terminal or
console. All its standard I/O (i.e. stdin, stdout, stderr) operations will
be redirected to the parent process, where they can be accessed via the
streams obtained using the methodsgetOutputStream(), getInputStream(), and
getErrorStream(). The parent process uses these streams to feed input to
and get output from the subprocess. Because some native platforms only
provide limited buffer size for standard input and output streams, failure
to promptly write the input stream or read the output stream of the
subprocess may cause the subprocess to block, or even deadlock.

*Question2*:

What is the best way to know about Spark job progress  final status in
Java ?

Thanks.



On Tue, Jul 28, 2015 at 1:17 PM, Corey Nolet cjno...@gmail.com wrote:



 On Tue, Jul 28, 2015 at 2:17 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Thanks Corey for your answer,

 Do you mean that final status : SUCCEEDED in terminal logs means that
 YARN RM could clean the resources after the application has finished
 (application finishing does not necessarily mean succeeded or failed) ?

 Correct.


 With that logic it totally makes sense.

 Basically the YARN logs does not say anything about the Spark job itself.
 It just says that Spark job resources have been cleaned up after the job
 completed and returned back to Yarn.


 If you have log aggregation enabled of your cluster, the yarn log
 command should give you any exceptions that were thrown in the driver /
 executors when you are running in yarn cluster mode. If you were running in
 yarn-client mode, you'd see the errors that caused a job to fail in your
 local log (errors that would cause a job to fail will be caught by the
 SparkContext on the driver) because the driver is running locally instead
 of being deployed in a yarn container. Also, using the Spark HistoryServer
 will give you a more visual insight into the exact problems (like which
 partitions failed, which executors died trying to process them, etc...)



 It would be great if Yarn logs could also say about the consequence of
 the job, because the user is interested in more about the job final status.


 This is just an artifact of running with yarn-cluster mode. It's still
 easy enough to run the yarn log command to see all the logs (you can grep
 for the node designated as the application master to find any exceptions in
 your driver that may show you why your application failed).  The
 HistoryServer would still give you enough information after the fact to see
 the failures.

 Generally, I submit my jobs in yarn-client mode while i'm testing so that
 I can spot errors right away. I generally only use yarn-cluster mode for
 jobs that are deployed onto operational hardware- that way if a job does
 fail, I can still use yarn log to find out why, but I don't need a local
 process running on the machine that submitted the job taking up resources
 (see the waitForAppCompletion property introduced into Spark 1.4).

 I'll also caveat my response and say that I have not used Spark's Python
 API so I can only give you a general overview of how the Yarn integration
 works from the Scala point of view.


 Hope this helps.


 Yarn related logs can be found in RM ,NM, DN, NN log files in detail.

 Thanks again.

 On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet cjno...@gmail.com wrote:

 Elkhan,

 What does the ResourceManager say about the final status of the job?
 Spark jobs that run as Yarn applications can fail but still successfully
 clean up their resources

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-27 Thread Elkhan Dadashov
Any updates on this bug ?

Why Spark log results  Job final status does not match ? (one saying that
job has failed, another stating that job has succeeded)

Thanks.


On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:

 Hi all,

 While running Spark Word count python example with intentional mistake in 
 *Yarn
 cluster mode*, Spark terminal states final status as SUCCEEDED, but log
 files state correct results indicating that the job failed.

 Why terminal log output  application log output contradict each other ?

 If i run same job on *local mode* then terminal logs and application logs
 match, where both state that job has failed to expected error in python
 script.

 More details: Scenario

 While running Spark Word count python example on *Yarn cluster mode*, if
 I make intentional error in wordcount.py by changing this line (I'm using
 Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions -
 which i tested):

 lines = sc.textFile(sys.argv[1], 1)

 into this line:

 lines = sc.textFile(*nonExistentVariable*,1)

 where nonExistentVariable variable was never created and initialized.

 then i run that example with this command (I put README.md into HDFS
 before running this command):

 *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

 The job runs and finishes successfully according the log printed in the
 terminal :
 *Terminal logs*:
 ...
 15/07/23 16:19:17 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:18 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:19 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:20 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:21 INFO yarn.Client: Application report for
 application_1437612288327_0013 (state: FINISHED)
 15/07/23 16:19:21 INFO yarn.Client:
  client token: N/A
  diagnostics: Shutdown hook called before final status was reported.
  ApplicationMaster host: 10.0.53.59
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1437693551439
  final status: *SUCCEEDED*
  tracking URL:
 http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
  user: edadashov
 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
 15/07/23 16:19:21 INFO util.Utils: Deleting directory
 /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

 But if look at log files generated for this application in HDFS - it
 indicates failure of the job with correct reason:
 *Application log files*:
 ...
 \00 stdout\00 179Traceback (most recent call last):
   File wordcount.py, line 32, in module
 lines = sc.textFile(nonExistentVariable,1)
 *NameError: name 'nonExistentVariable' is not defined*


 Why terminal output - final status: *SUCCEEDED , *is not matching
 application log results - failure of the job (NameError: name
 'nonExistentVariable' is not defined) ?

 Is this bug ? Is there Jira ticket related to this issue ? (Is someone
 assigned to this issue ?)

 If i run this wordcount .py example (with mistake line) in local mode,
 then terminal log states that the job has failed in terminal logs too.

 *./bin/spark-submit wordcount.py /README.md*

 *Terminal logs*:

 ...
 15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events to
 hdfs:///app-logs/local-1437694314943
 Traceback (most recent call last):
   File /home/edadashov/tools/myspark/spark/wordcount.py, line 32, in
 module
 lines = sc.textFile(nonExistentVariable,1)
 NameError: name 'nonExistentVariable' is not defined
 15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from shutdown
 hook


 Thanks.




-- 

Best regards,
Elkhan Dadashov


[ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-23 Thread Elkhan Dadashov
Hi all,

While running Spark Word count python example with intentional mistake in *Yarn
cluster mode*, Spark terminal states final status as SUCCEEDED, but log
files state correct results indicating that the job failed.

Why terminal log output  application log output contradict each other ?

If i run same job on *local mode* then terminal logs and application logs
match, where both state that job has failed to expected error in python
script.

More details: Scenario

While running Spark Word count python example on *Yarn cluster mode*, if I
make intentional error in wordcount.py by changing this line (I'm using
Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions -
which i tested):

lines = sc.textFile(sys.argv[1], 1)

into this line:

lines = sc.textFile(*nonExistentVariable*,1)

where nonExistentVariable variable was never created and initialized.

then i run that example with this command (I put README.md into HDFS before
running this command):

*./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

The job runs and finishes successfully according the log printed in the
terminal :
*Terminal logs*:
...
15/07/23 16:19:17 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:18 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:19 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:20 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:21 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: FINISHED)
15/07/23 16:19:21 INFO yarn.Client:
 client token: N/A
 diagnostics: Shutdown hook called before final status was reported.
 ApplicationMaster host: 10.0.53.59
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1437693551439
 final status: *SUCCEEDED*
 tracking URL:
http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
 user: edadashov
15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
15/07/23 16:19:21 INFO util.Utils: Deleting directory
/tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

But if look at log files generated for this application in HDFS - it
indicates failure of the job with correct reason:
*Application log files*:
...
\00 stdout\00 179Traceback (most recent call last):
  File wordcount.py, line 32, in module
lines = sc.textFile(nonExistentVariable,1)
*NameError: name 'nonExistentVariable' is not defined*


Why terminal output - final status: *SUCCEEDED , *is not matching
application log results - failure of the job (NameError: name
'nonExistentVariable' is not defined) ?

Is this bug ? Is there Jira ticket related to this issue ? (Is someone
assigned to this issue ?)

If i run this wordcount .py example (with mistake line) in local mode, then
terminal log states that the job has failed in terminal logs too.

*./bin/spark-submit wordcount.py /README.md*

*Terminal logs*:

...
15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events to
hdfs:///app-logs/local-1437694314943
Traceback (most recent call last):
  File /home/edadashov/tools/myspark/spark/wordcount.py, line 32, in
module
lines = sc.textFile(nonExistentVariable,1)
NameError: name 'nonExistentVariable' is not defined
15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from shutdown
hook


Thanks.


Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules (i.e., numpy) to be shipped with)

2015-07-17 Thread Elkhan Dadashov
Hi all,

After  SPARK-5479 https://issues.apache.org/jira/browse/SPARK-5479 issue
fix (thanks to Marcelo Vanzin), now pyspark handles several python files
(or in zip folder with __init__.py) addition to PYTHONPATH correctly in
yarn-cluster mode.

But adding python module as zip folder, still fails - if that zip folder
have other file types (compiled byte code or c code) other than python
files.

Let's say you want to provide numpy package to --py-files flag, which is
downloaded from as numpy-1.9.2.zip from this link
https://pypi.python.org/pypi/numpy does not work - complaining import
numpy line has failed.

numpy module need to be *installed* before importing it in Spark Python
script.

So does that mean you need to install on all machines required python
modules before using pyspark ?

Or what is best pattern using any python 3rd party module in Spark Python
job ?

Thanks.


On Thu, Jun 25, 2015 at 12:55 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 Please take a look at the pull request with the actual fix; that will
 explain why it's the same issue.

 On Thu, Jun 25, 2015 at 12:51 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Thanks Marcelo.

 But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local
 directory* (can also put in HDFS), but still fails.

 But SPARK-5479 https://issues.apache.org/jira/browse/SPARK-5479 is :
 PySpark on yarn mode need to support *non-local* python files.

 The job fails only when i try to include 3rd party dependency from local
 computer with --py-files (in Spark 1.4)

 Both of these commands succeed:

 ./bin/spark-submit --master yarn-cluster --verbose hdfs:///pi.py
 ./bin/spark-submit --master yarn-cluster --deploy-mode cluster  --verbose
 examples/src/main/python/pi.py

 But in this particular example with 3rd party numpy module:

 ./bin/spark-submit --verbose --master yarn-cluster --py-files
  mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
 mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0


 All these files :

 mypython/libs/numpy-1.9.2.zip,  mypython/scripts/kmeans.py are local
 files, kmeans_data.txt is in HDFS.


 Thanks.


 On Thu, Jun 25, 2015 at 12:22 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 That sounds like SPARK-5479 which is not in 1.4...

 On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 In addition to previous emails, when i try to execute this command from
 command line:

 ./bin/spark-submit --verbose --master yarn-cluster --py-files
  mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
 mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0


 - numpy-1.9.2.zip - is downloaded numpy package
 - kmeans.py is default example which comes with Spark 1.4
 - kmeans_data.txt  - is default data file which comes with Spark 1.4


 It fails saying that it could not find numpy:

 File kmeans.py, line 31, in module
 import numpy
 ImportError: No module named numpy

 Has anyone run Python Spark application on Yarn-cluster mode ? (which
 has 3rd party Python modules to be shipped with)

 What are the configurations or installations to be done before running
 Python Spark job with 3rd party dependencies on Yarn-cluster ?

 Thanks in advance.


 --
 Marcelo




 --

 Best regards,
 Elkhan Dadashov




 --
 Marcelo




-- 

Best regards,
Elkhan Dadashov


Re: Command builder problem when running worker in Windows

2015-07-17 Thread Elkhan Dadashov
Run Spark with --verbose flag, to see what it read for that path.

I guess in Windows if you are using backslash, you need 2 of them (\\), or
just use forward slashes everywhere.

On Fri, Jul 17, 2015 at 2:40 PM, Julien Beaudan jbeau...@stottlerhenke.com
wrote:

 Hi,

 I running a stand-alone cluster in Windows 7, and when I try to run any
 worker on the machine, I get the following error:

 15/07/17 14:14:43 ERROR ExecutorRunner: Error running executor
 java.io.IOException: Cannot run program
 C:\cygdrive\c\Users\jbeaudan\Spark\spark-1.3.1-bin-hadoop2.4/bin/compute-classpath.cmd
 (in directory .): CreateProcess error=2, The system cannot find the file
 specified
 at java.lang.ProcessBuilder.start(Unknown Source)
 at org.apache.spark.util.Utils$.executeCommand(Utils.scala:1067)
 at
 org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1084)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:112)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:61)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:47)
 at
 org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:132)
 at
 org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:68)
 Caused by: java.io.IOException: CreateProcess error=2, The system cannot
 find the file specified
 at java.lang.ProcessImpl.create(Native Method)
 at java.lang.ProcessImpl.init(Unknown Source)
 at java.lang.ProcessImpl.start(Unknown Source)
 ... 8 more


 I'm pretty sure the problem is that Spark is looking for the following
 path, which mixes forward and back slashes:


 C:\cygdrive\c\Users\jbeaudan\Spark\spark-1.3.1-bin-hadoop2.4/bin/compute-classpath.cmd

 Is there anyway to fix this?

 (Also, I have also tried running this from a normal terminal, instead of
 from cygwin, and I get the same issue, except this time the path is:
 C:\Users\jbeaudan\Spark\spark-1.3.1-bin-hadoop2.4\bin../bin/compute-classpath.cmd
 )

 Thank you!

 Julien






-- 

Best regards,
Elkhan Dadashov


Re: Command builder problem when running worker in Windows

2015-07-17 Thread Elkhan Dadashov
Are you running it from command line (CLI) or through SparkLauncher ?

If you can share the command (./bin/spark-submit ...) or the code snippet
you are running, then it can give some clue.

On Fri, Jul 17, 2015 at 3:30 PM, Julien Beaudan jbeau...@stottlerhenke.com
wrote:

  Hi Elkhan,

 I ran Spark with --verbose, but the output looked the same to me - what
 should I be looking for? At the beginning, the system properties which are
 set are:

 System properties:
 SPARK_SUBMIT - true
 spark.app.name - tests.testFileReader
 spark.jars -
 file:/C:/Users/jbeaudan/Spark/spark-1.3.1-bin-hadoop2.4/sparkTest1.jar
 spark.master - spark://192.168.194.128:7077
 Classpath elements:
 file:/C:/Users/jbeaudan/Spark/spark-1.3.1-bin-hadoop2.4/sparkTest1.jar

 I'm not sure why, but the file paths here seem formatted correctly (it is
 same from the command terminal and Cygwin), so the path must get edited
 afterwards?

 Julien


 On 07/17/2015 03:00 PM, Elkhan Dadashov wrote:

 Run Spark with --verbose flag, to see what it read for that path.

  I guess in Windows if you are using backslash, you need 2 of them (\\),
 or just use forward slashes everywhere.

 On Fri, Jul 17, 2015 at 2:40 PM, Julien Beaudan 
 jbeau...@stottlerhenke.com wrote:

 Hi,

 I running a stand-alone cluster in Windows 7, and when I try to run any
 worker on the machine, I get the following error:

 15/07/17 14:14:43 ERROR ExecutorRunner: Error running executor
 java.io.IOException: Cannot run program
 C:\cygdrive\c\Users\jbeaudan\Spark\spark-1.3.1-bin-hadoop2.4/bin/compute-classpath.cmd
 (in directory .): CreateProcess error=2, The system cannot find the file
 specified
 at java.lang.ProcessBuilder.start(Unknown Source)
 at org.apache.spark.util.Utils$.executeCommand(Utils.scala:1067)
 at
 org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1084)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildJavaOpts(CommandUtils.scala:112)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:61)
 at
 org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:47)
 at
 org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:132)
 at
 org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:68)
 Caused by: java.io.IOException: CreateProcess error=2, The system cannot
 find the file specified
 at java.lang.ProcessImpl.create(Native Method)
 at java.lang.ProcessImpl.init(Unknown Source)
 at java.lang.ProcessImpl.start(Unknown Source)
 ... 8 more


 I'm pretty sure the problem is that Spark is looking for the following
 path, which mixes forward and back slashes:


 C:\cygdrive\c\Users\jbeaudan\Spark\spark-1.3.1-bin-hadoop2.4/bin/compute-classpath.cmd

 Is there anyway to fix this?

 (Also, I have also tried running this from a normal terminal, instead of
 from cygwin, and I get the same issue, except this time the path is:
 C:\Users\jbeaudan\Spark\spark-1.3.1-bin-hadoop2.4\bin../bin/compute-classpath.cmd
 )

 Thank you!

 Julien






  --

  Best regards,
 Elkhan Dadashov





-- 

Best regards,
Elkhan Dadashov


Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
More particular example:

I run pi.py Spark Python example in *yarn-cluster* mode (--master) through
SparkLauncher in Java.

While the program is running, these are the stats of how much memory each
process takes:

SparkSubmit process : 11.266 *gigabyte* Virtual Memory

ApplicationMaster process: 2303480 *byte *Virtual Memory

Why does SparkSubmit process takes so much virtual memory in yarn-cluster
mode ? (which usually causes your Yarn container to be killed because of
outofmemory exception)

On Tue, Jul 14, 2015 at 9:39 AM, Elkhan Dadashov elkhan8...@gmail.com
wrote:

 Hi all,

 If you want to launch Spark job from Java in programmatic way, then you
 need to Use SparkLauncher.

 SparkLauncher uses ProcessBuilder for creating new process - Java seems
 handle process creation in an inefficient way.

 
 When you execute a process, you must first fork() and then exec(). Forking
 creates a child process by duplicating the current process. Then, you call
 exec() to change the “process image” to a new “process image”, essentially
 executing different code within the child process.
 ...
 When we want to fork a new process, we have to copy the ENTIRE Java JVM…
 What we really are doing is requesting the same amount of memory the JVM
 been allocated.
 
 Source: http://bryanmarty.com/2012/01/14/forking-jvm/
 This link http://bryanmarty.com/2012/01/14/forking-jvm/ shows different
 solutions for launching new processes in Java.

 If our main program JVM already uses big amount of memory (let's say 6GB),
 then for creating new process while using SparkLauncher, we need 12 GB
 (virtual) memory available, even though we will not use it.

 It will be very helpful if someone could share his/her experience for
 handing this memory inefficiency in creating new processes in Java.




-- 

Best regards,
Elkhan Dadashov


ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Elkhan Dadashov
Hi all,

If you want to launch Spark job from Java in programmatic way, then you
need to Use SparkLauncher.

SparkLauncher uses ProcessBuilder for creating new process - Java seems
handle process creation in an inefficient way.


When you execute a process, you must first fork() and then exec(). Forking
creates a child process by duplicating the current process. Then, you call
exec() to change the “process image” to a new “process image”, essentially
executing different code within the child process.
...
When we want to fork a new process, we have to copy the ENTIRE Java JVM…
What we really are doing is requesting the same amount of memory the JVM
been allocated.

Source: http://bryanmarty.com/2012/01/14/forking-jvm/
This link http://bryanmarty.com/2012/01/14/forking-jvm/ shows different
solutions for launching new processes in Java.

If our main program JVM already uses big amount of memory (let's say 6GB),
then for creating new process while using SparkLauncher, we need 12 GB
(virtual) memory available, even though we will not use it.

It will be very helpful if someone could share his/her experience for
handing this memory inefficiency in creating new processes in Java.


Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
Thanks, Marcelo.

That article confused me, thanks for correcting it  helpful tips.

I looked into Virtual memory usage (jmap+jvisualvm) does not show that 11.5
g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory usage
using top -p pid command for SparkSubmit process.

The virtual memory consumed by a process is the total of everything that's
in the process memory map. This includes data (eg, the Java heap), but also
all of the shared libraries and memory-mapped files used by the program.
(source:
http://stackoverflow.com/questions/561245/virtual-memory-usage-from-java-under-linux-too-much-memory-used
)

But if i use *pmap* on SparkSubmit process id, then it shows lots of things
attached to that process which adds up in Virtual memory (libgcc_s.so.1,
libsunec.so, libnss_files-2.19.so, libmanagement.so, cldrdata.jar,
datanucleus-core-3.2.10.jar, spark-assembly-1.5.0-SNAPSHOT-hadoop2.3.0.jar,
locale-archive, rt.jar, sunjce_provider.jar,  sunec.jar, sunpkcs11.jar,
jsse.jar, libzip.so , ..., and lots of huge anon files)

That is the main reason why it adds up to 11.5 GB Virtual memory usage.

P.S: SPARK_PRINT_LAUNCH_CMD=1 did not have any effect.

On Tue, Jul 14, 2015 at 10:57 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 On Tue, Jul 14, 2015 at 9:53 AM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 While the program is running, these are the stats of how much memory each
 process takes:

 SparkSubmit process : 11.266 *gigabyte* Virtual Memory

 ApplicationMaster process: 2303480 *byte *Virtual Memory


 That SparkSubmit number looks very suspicious. In yarn-cluster mode,
 SparkSubmit doesn't do much and should not use a lot of memory. You could
 set SPARK_PRINT_LAUNCH_CMD=1 before launching the app to see the exact
 java command line being used, and see whether it has any suspicious
 configuration. You could also use jmap to dump the heap and look at it with
 jvisualvm, and see if there's any low hanging fruit w.r.t. what's using the
 memory.

 Regarding the fork / exec comment, that's very misleading. OSes are very
 efficient when forking - they'll not copy the entire parent process,
 instead they'll do COW on memory pages that change. So if you do an exec
 right afterwards, you're basically copying very little memory.

 --
 Marcelo




-- 

Best regards,
Elkhan Dadashov


Does Spark driver talk to NameNode directly or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks and informs Spark Driver ? (for launching Executors on nodes wh

2015-07-13 Thread Elkhan Dadashov
Hi folks,

I have a question regarding scheduling of Spark job on Yarn cluster.

Let's say there are 5 nodes on Yarn cluster: A,B,C, D, E

In Spark job I'll be reading some huge text file (sc.textFile(fileName))
from HDFS and create an RDD.

Assume that only nodes A, E contain the blocks of that text file.

I'm curious if Spark driver talks to NameNode or Yarn Resource Manager
talks to NameNode to know the nodes which has required input blocks ?

How does Spark or Yarn launch executors on the nodes having required blocks
?

Which component gets that info (data blocks in each node) from NameNode ?

This info is required while launching executors on WorkerNodes for using
data locality.

How do Spark executors gets launched on nodes A,E ? But not B,C,D ?

Or does Yarn RM launches executors on random nodes, and then if data block
does not exist in that node extra duplication/copy is one on HDFS ? (copy
data from A or E to one of {B,C,D} where executor got launched)

Thanks.


Re: Does Spark driver talk to NameNode directly or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks and informs Spark Driver ? (for launching Executors on node

2015-07-13 Thread Elkhan Dadashov
Thanks Michael for your answer.

But Yarn of today does not manage HDFS. How does Yarn RM get to know HDFS
blocks in each data node ?

Do you mean it is Yarn RM contacts NameNode for HDFS block data in each
node, and then decided to launch executor on the nodes which has required
input data blocks ?

Either Yarn RM or Spak driver need to talk to NameNode which has HDFS
blocks in each node data. I'm still curious which one talks to NameNode.

Thanks.



On Mon, Jul 13, 2015 at 12:38 PM, Michael Segel michael_se...@hotmail.com
wrote:

 I believe the short answer is that YARN is responsible for the scheduling
 and will pick where the job runs.

 Look at it this way… you’re running a YARN job that runs spark.

 Yarn should run the job on A and E, however… if there aren’t enough free
 resources, it will run the job elsewhere.


 On Jul 13, 2015, at 10:10 AM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Hi folks,

 I have a question regarding scheduling of Spark job on Yarn cluster.

 Let's say there are 5 nodes on Yarn cluster: A,B,C, D, E

 In Spark job I'll be reading some huge text file (sc.textFile(fileName))
 from HDFS and create an RDD.

 Assume that only nodes A, E contain the blocks of that text file.

 I'm curious if Spark driver talks to NameNode or Yarn Resource Manager
 talks to NameNode to know the nodes which has required input blocks ?

 How does Spark or Yarn launch executors on the nodes having required
 blocks ?

 Which component gets that info (data blocks in each node) from NameNode ?

 This info is required while launching executors on WorkerNodes for using
 data locality.

 How do Spark executors gets launched on nodes A,E ? But not B,C,D ?

 Or does Yarn RM launches executors on random nodes, and then if data block
 does not exist in that node extra duplication/copy is one on HDFS ? (copy
 data from A or E to one of {B,C,D} where executor got launched)

 Thanks.


 The opinions expressed here are mine, while they may reflect a cognitive
 thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com








-- 

Best regards,
Elkhan Dadashov


Re: Pyspark not working on yarn-cluster mode

2015-07-10 Thread Elkhan Dadashov
Yes, you can launch (from Java code) pyspark scripts with yarn-cluster mode
without using the spark-submit script.

Check SparkLauncher code in this link
https://github.com/apache/spark/tree/master/launcher/src/main/java/org/apache/spark/launcher
. SparkLauncher is not dependent on Spark core jars, so it is very easy to
integrate it into your project.

Code example for launching Spark job without spark-submit script:

Process spark = new SparkLauncher().setSparkHome(path_to_spark)

.setAppName(pythonScriptName).setMaster(yarn-cluster)

.setAppResource(sparkScriptPath.toString()).addAppArgs(params)

.addPyFile(otherPythonScriptPath.toString()).launch();

But in order to correctly handling python path addition of 3rd party
packages, which Marcelo has implemented in patch Spark 5479
https://issues.apache.org/jira/browse/SPARK-5479, download latest source
code of Spark, and built it yourself with maven.

Other pre-built Spark versions does not include that patch.



On Fri, Jul 10, 2015 at 9:52 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 To add to this, conceptually, it makes no sense to launch something in
 yarn-cluster mode by creating a SparkContext on the client - the whole
 point of yarn-cluster mode is that the SparkContext runs on the cluster,
 not on the client.

 On Thu, Jul 9, 2015 at 2:35 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 You cannot run Spark in cluster mode by instantiating a SparkContext like
 that.

 You have to launch it with the spark-submit command line script.

 On Thu, Jul 9, 2015 at 2:23 PM, jegordon jgordo...@gmail.com wrote:

 Hi to all,

 Is there any way to run pyspark scripts with yarn-cluster mode without
 using
 the spark-submit script? I need it in this way because i will integrate
 this
 code into a django web app.

 When i try to run any script in yarn-cluster mode i got the following
 error
 :

 org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't
 running on a cluster. Deployment to YARN is not supported directly by
 SparkContext. Please use spark-submit.


 I'm creating the sparkContext in the following way :

 conf = (SparkConf()
 .setMaster(yarn-cluster)
 .setAppName(DataFrameTest))

 sc = SparkContext(conf = conf)

 #Dataframe code 

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-not-working-on-yarn-cluster-mode-tp23755.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Marcelo





-- 

Best regards,
Elkhan Dadashov


Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Elkhan Dadashov
Thanks Marcelo.

But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local
directory* (can also put in HDFS), but still fails.

But SPARK-5479 https://issues.apache.org/jira/browse/SPARK-5479 is :
PySpark on yarn mode need to support *non-local* python files.

The job fails only when i try to include 3rd party dependency from local
computer with --py-files (in Spark 1.4)

Both of these commands succeed:

./bin/spark-submit --master yarn-cluster --verbose hdfs:///pi.py
./bin/spark-submit --master yarn-cluster --deploy-mode cluster  --verbose
examples/src/main/python/pi.py

But in this particular example with 3rd party numpy module:

./bin/spark-submit --verbose --master yarn-cluster --py-files
 mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0


All these files :

mypython/libs/numpy-1.9.2.zip,  mypython/scripts/kmeans.py are local files,
kmeans_data.txt is in HDFS.


Thanks.


On Thu, Jun 25, 2015 at 12:22 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 That sounds like SPARK-5479 which is not in 1.4...

 On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 In addition to previous emails, when i try to execute this command from
 command line:

 ./bin/spark-submit --verbose --master yarn-cluster --py-files
  mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
 mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0


 - numpy-1.9.2.zip - is downloaded numpy package
 - kmeans.py is default example which comes with Spark 1.4
 - kmeans_data.txt  - is default data file which comes with Spark 1.4


 It fails saying that it could not find numpy:

 File kmeans.py, line 31, in module
 import numpy
 ImportError: No module named numpy

 Has anyone run Python Spark application on Yarn-cluster mode ? (which has
 3rd party Python modules to be shipped with)

 What are the configurations or installations to be done before running
 Python Spark job with 3rd party dependencies on Yarn-cluster ?

 Thanks in advance.


 --
 Marcelo




-- 

Best regards,
Elkhan Dadashov


Re: How to run kmeans.py Spark example in yarn-cluster ?

2015-06-25 Thread Elkhan Dadashov
Hi all,

Does Spark 1.4 version support Python applications on Yarn-cluster ?
(--master yarn-cluster)

Does Spark 1.4 version support Python applications with deploy-mode cluster
? (--deploy-mode cluster)

How can we ship 3rd party Python dependencies with Python Spark job ?
(running on Yarn cluster)

Thanks.






On Wed, Jun 24, 2015 at 3:13 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:

 Hi all,

 I'm trying to run kmeans.py Spark example on Yarn cluster mode. I'm using
 Spark 1.4.0.

 I'm passing numpy-1.9.2.zip with --py-files flag.

 Here is the command I'm trying to execute but it fails:

 ./bin/spark-submit --master yarn-cluster --verbose  --py-files
mypython/libs/numpy-1.9.2.zip mypython/scripts/kmeans.py
 /kmeans_data.txt 5 1.0


 - I have kmeans_data.txt in HDFS in / directory.


 I receive this error:

 
 ...
 15/06/24 15:08:21 INFO yarn.ApplicationMaster: Final app status:
 SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status
 was reported.)
 15/06/24 15:08:21 INFO yarn.ApplicationMaster: Unregistering
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before
 final status was reported.)
 15/06/24 15:08:21 INFO yarn.ApplicationMaster: Deleting staging directory
 .sparkStaging/application_1435182120590_0009
 15/06/24 15:08:22 INFO util.Utils: Shutdown hook called
 \00 stdout\00 134Traceback (most recent call last):
   File kmeans.py, line 31, in module
 import numpy as np
 ImportError: No module named numpy
 ...

 

 Any idea why it cannot import numpy-1.9.2.zip while running kmeans.py
 example provided with Spark ?

 How can we run python script which has other 3rd-party python module
 dependency on yarn-cluster ?

 Thanks.




-- 

Best regards,
Elkhan Dadashov


Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Elkhan Dadashov
In addition to previous emails, when i try to execute this command from
command line:

./bin/spark-submit --verbose --master yarn-cluster --py-files
 mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0


- numpy-1.9.2.zip - is downloaded numpy package
- kmeans.py is default example which comes with Spark 1.4
- kmeans_data.txt  - is default data file which comes with Spark 1.4


It fails saying that it could not find numpy:

File kmeans.py, line 31, in module
import numpy
ImportError: No module named numpy

Has anyone run Python Spark application on Yarn-cluster mode ? (which has
3rd party Python modules to be shipped with)

What are the configurations or installations to be done before running
Python Spark job with 3rd party dependencies on Yarn-cluster ?

Thanks in advance.

On Thu, Jun 25, 2015 at 12:09 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:

 Hi all,

 Does Spark 1.4 version support Python applications on Yarn-cluster ?
 (--master yarn-cluster)

 Does Spark 1.4 version support Python applications with deploy-mode
 cluster ? (--deploy-mode cluster)

 How can we ship 3rd party Python dependencies with Python Spark job ?
 (running on Yarn cluster)

 Thanks.






 On Wed, Jun 24, 2015 at 3:13 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Hi all,

 I'm trying to run kmeans.py Spark example on Yarn cluster mode. I'm using
 Spark 1.4.0.

 I'm passing numpy-1.9.2.zip with --py-files flag.

 Here is the command I'm trying to execute but it fails:

 ./bin/spark-submit --master yarn-cluster --verbose  --py-files
mypython/libs/numpy-1.9.2.zip mypython/scripts/kmeans.py
 /kmeans_data.txt 5 1.0


 - I have kmeans_data.txt in HDFS in / directory.


 I receive this error:

 
 ...
 15/06/24 15:08:21 INFO yarn.ApplicationMaster: Final app status:
 SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status
 was reported.)
 15/06/24 15:08:21 INFO yarn.ApplicationMaster: Unregistering
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before
 final status was reported.)
 15/06/24 15:08:21 INFO yarn.ApplicationMaster: Deleting staging directory
 .sparkStaging/application_1435182120590_0009
 15/06/24 15:08:22 INFO util.Utils: Shutdown hook called
 \00 stdout\00 134Traceback (most recent call last):
   File kmeans.py, line 31, in module
 import numpy as np
 ImportError: No module named numpy
 ...

 

 Any idea why it cannot import numpy-1.9.2.zip while running kmeans.py
 example provided with Spark ?

 How can we run python script which has other 3rd-party python module
 dependency on yarn-cluster ?

 Thanks.




 --

 Best regards,
 Elkhan Dadashov



How to run kmeans.py Spark example in yarn-cluster ?

2015-06-24 Thread Elkhan Dadashov
Hi all,

I'm trying to run kmeans.py Spark example on Yarn cluster mode. I'm using
Spark 1.4.0.

I'm passing numpy-1.9.2.zip with --py-files flag.

Here is the command I'm trying to execute but it fails:

./bin/spark-submit --master yarn-cluster --verbose  --py-files
   mypython/libs/numpy-1.9.2.zip mypython/scripts/kmeans.py
/kmeans_data.txt 5 1.0


- I have kmeans_data.txt in HDFS in / directory.


I receive this error:


...
15/06/24 15:08:21 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED,
exitCode: 0, (reason: Shutdown hook called before final status was
reported.)
15/06/24 15:08:21 INFO yarn.ApplicationMaster: Unregistering
ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before
final status was reported.)
15/06/24 15:08:21 INFO yarn.ApplicationMaster: Deleting staging directory
.sparkStaging/application_1435182120590_0009
15/06/24 15:08:22 INFO util.Utils: Shutdown hook called
\00 stdout\00 134Traceback (most recent call last):
  File kmeans.py, line 31, in module
import numpy as np
ImportError: No module named numpy
...



Any idea why it cannot import numpy-1.9.2.zip while running kmeans.py
example provided with Spark ?

How can we run python script which has other 3rd-party python module
dependency on yarn-cluster ?

Thanks.


Re: What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Elkhan Dadashov
Thanks Andrew.
We cannot include Spark in our Java project due to dependency issues. The
Spark will not be exposed to clients.
What we want todo is to put spark tarball (in worst case) into HDFS, so
through our java app which runs in local mode, launch spark-submit script
with user python files.
So the input spark-submit will be only python scripts of users.
What do i need to set up, to call spark-submit directly from HDFS ?
What is the reason that spark-submit cannot be runfrom HDFS directly if
spark tarball is in HDFS ?
My intention is to launch spark-submit script through the Hadoop map job.
Hadoop map job will run from our app, but it will launch Spark job in Yarn
cluster through running spark-submit script.
Thanks.

On Fri, Jun 19, 2015, 6:58 PM Andrew Or and...@databricks.com wrote:

 Hi Elkhan,

 Spark submit depends on several things: the launcher jar (1.3.0+ only),
 the spark-core jar, and the spark-yarn jar (in your case). Why do you want
 to put it in HDFS though? AFAIK you can't execute scripts directly from
 HDFS; you need to copy them to a local file system first. I don't see clear
 benefits of not just running Spark submit from source or from one of the
 distributions.

 -Andrew

 2015-06-19 10:12 GMT-07:00 Elkhan Dadashov elkhan8...@gmail.com:

 Hi all,

 If I want to ship spark-submit script to HDFS. and then call it from HDFS
 location for starting Spark job, which other files/folders/jars need to be
 transferred into HDFS with spark-submit script ?

 Due to some dependency issues, we can include Spark in our Java
 application, so instead we will allow limited usage of Spark only with
 Python files.

 So if I want to put spark-submit script into HDFS, and call it to execute
 Spark job in Yarn cluster, what else need to be put into HDFS with it ?

 (Using Spark only for execution Spark jobs written in Python)

 Thanks.





What files/folders/jars spark-submit script depend on ?

2015-06-19 Thread Elkhan Dadashov
Hi all,

If I want to ship spark-submit script to HDFS. and then call it from HDFS
location for starting Spark job, which other files/folders/jars need to be
transferred into HDFS with spark-submit script ?

Due to some dependency issues, we can include Spark in our Java
application, so instead we will allow limited usage of Spark only with
Python files.

So if I want to put spark-submit script into HDFS, and call it to execute
Spark job in Yarn cluster, what else need to be put into HDFS with it ?

(Using Spark only for execution Spark jobs written in Python)

Thanks.


Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Elkhan Dadashov
Hi all,

Is there any way running Spark job in programmatic way on Yarn cluster
without using spark-submit script ?

I cannot include Spark jars on my Java application (due o dependency
conflict and other reasons), so I'll be shipping Spark assembly uber jar
(spark-assembly-1.3.1-hadoop2.3.0.jar) to Yarn cluster, and then execute
job (Python or Java) on Yarn-cluster.

So is there any way running Spark job implemented in python file/Java class
without calling it through spark-submit script ?

Thanks.


Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Elkhan Dadashov
This is not independent programmatic way of running of Spark job on Yarn
cluster.

That example demonstrates running on *Yarn-client* mode, also will be
dependent of Jetty. Users writing Spark programs do not want to depend on
that.

I found this SparkLauncher class introduced in Spark 1.4 version (
https://github.com/apache/spark/tree/master/launcher) which allows running
Spark jobs in programmatic way.

SparkLauncher exists in Java and Scala APIs, but I could not find in Python
API.

Did not try it yet, but seems promising.

Example:

import org.apache.spark.launcher.SparkLauncher;

public class MyLauncher {

public static void main(String[] args) throws Exception {

 Process spark = new SparkLauncher()

   .setAppResource(/my/app.jar)

   .setMainClass(my.spark.app.Main)

   .setMaster(local)

   .setConf(SparkLauncher.DRIVER_MEMORY, 2g)

.launch();

  spark.waitFor();

   }

  }

}



On Wed, Jun 17, 2015 at 5:51 PM, Corey Nolet cjno...@gmail.com wrote:

 An example of being able to do this is provided in the Spark Jetty Server
 project [1]

 [1] https://github.com/calrissian/spark-jetty-server

 On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:

 Hi all,

 Is there any way running Spark job in programmatic way on Yarn cluster
 without using spark-submit script ?

 I cannot include Spark jars on my Java application (due o dependency
 conflict and other reasons), so I'll be shipping Spark assembly uber jar
 (spark-assembly-1.3.1-hadoop2.3.0.jar) to Yarn cluster, and then execute
 job (Python or Java) on Yarn-cluster.

 So is there any way running Spark job implemented in python file/Java
 class without calling it through spark-submit script ?

 Thanks.






-- 

Best regards,
Elkhan Dadashov


Re: Spark Java API and minimum set of 3rd party dependencies

2015-06-12 Thread Elkhan Dadashov
Thanks for prompt response, Sean.

The issue is that we are restricted on dependencies we can include in our
project.

There are 2 issues while including dependencies:

1) there are several dependencies which we and Spark has, but the versions
are conflicting.
2) there are dependencies Spark has, and our project does not have.

How do we handle these 2 cases differently for including Spark dependencies
(direct and transitive ones)?

We need to include all dependencies (so there should not be 3rd party
transitive dependency) in our POM file, more like Apache Ivy style of
management of dependencies (which includes all transitive dependencies in
POM file) rather than Maven style.

Our main goal is: We want to integrate Spark in our Java application using
the Spark Java APIi and run then on the Yarn clusters.

Thanks a lot.


On Fri, Jun 12, 2015 at 11:17 AM, Sean Owen so...@cloudera.com wrote:

 You don't add dependencies to your app -- you mark Spark as 'provided'
 in the build and you rely on the deployed Spark environment to provide
 it.

 On Fri, Jun 12, 2015 at 7:14 PM, Elkhan Dadashov elkhan8...@gmail.com
 wrote:
  Hi all,
 
  We want to integrate Spark in our Java application using the Spark Java
 Api
  and run then on the Yarn clusters.
 
  If i want to run Spark on Yarn, which dependencies are must for
 including ?
 
  I looked at Spark POM which lists that Spark requires 50+ 3rd party
  dependencies.
 
  Is there minimum set of Spark dependencies which are necessary for Spark
  Java API  (for Spark client run on Yarn cluster) ?
 
  Thanks in advance.
 




-- 

Best regards,
Elkhan Dadashov


Spark Java API and minimum set of 3rd party dependencies

2015-06-12 Thread Elkhan Dadashov
Hi all,

We want to integrate Spark in our Java application using the Spark Java Api
and run then on the Yarn clusters.

If i want to run Spark on Yarn, which dependencies are must for including ?

I looked at Spark POM
http://central.maven.org/maven2/org/apache/spark/spark-core_2.10/1.3.1/spark-core_2.10-1.3.1.pom
which lists that Spark requires 50+ 3rd party dependencies.

Is there minimum set of Spark dependencies which are necessary for Spark
Java API  (for Spark client run on Yarn cluster) ?

Thanks in advance.


Is it possible to see Spark jobs on MapReduce job history ? (running Spark on YARN cluster)

2015-06-11 Thread Elkhan Dadashov
Hi all,

I wonder if anyone has used use MapReduce Job History to show Spark jobs.

I can see my Spark jobs (Spark running on Yarn cluster) on Resource manager
(RM).

I start Spark History server, and then through Spark's web-based user
interface I can monitor the cluster (and track cluster and job statistics).
Basically Yarn RM gets linked to Spark History server, which enables
monitoring.

But instead of using Spark History Server , is it possible to see Spark
jobs on MapReduce job history ? (in addition to seeing them on RM)

(I know through yarn logs -applicationId app ID we can get all logs after
Spark job has completed, but my concern is to see the logs and completed
jobs  through common web ui - MapReduce Job History )

Thanks in advance.


Running SparkPi ( or JavaWordCount) example fails with Job aborted due to stage failure: Task serialization failed

2015-06-08 Thread Elkhan Dadashov
Hello,

Running Spark examples fails on one machine, but succeeds in Virtual
Machine with exact same Spark  Java version installed.

The weird part it fails on one machine, but runs successfully on VM.

Did anyone face same problem ? Any solution tip ?

Thanks in advance.

*Spark version*: spark-1.3.1-bin-hadoop2.4
*Java Version*:
java version 1.8.0_45
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

When i want to execute SparkPi example using this command:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
local lib/spark-examples*.jar 4

The job fails with this log below:

15/06/08 19:03:37 INFO spark.SparkContext: Running Spark version 1.3.1
15/06/08 19:03:37 WARN util.Utils: Your hostname, edadashov-wsl resolves to
a loopback address: 127.0.1.1; using 10.0.53.59 instead (on interface em1)
15/06/08 19:03:37 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind
to another address
15/06/08 19:03:37 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/06/08 19:03:37 INFO spark.SecurityManager: Changing view acls to:
edadashov
15/06/08 19:03:37 INFO spark.SecurityManager: Changing modify acls to:
edadashov
15/06/08 19:03:37 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(edadashov); users with modify permissions: Set(edadashov)
15/06/08 19:03:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/06/08 19:03:37 INFO Remoting: Starting remoting
15/06/08 19:03:38 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkdri...@edadashov-wsl.internal.salesforce.com:46368]
15/06/08 19:03:38 INFO util.Utils: Successfully started service
'sparkDriver' on port 46368.
15/06/08 19:03:38 INFO spark.SparkEnv: Registering MapOutputTracker
15/06/08 19:03:38 INFO spark.SparkEnv: Registering BlockManagerMaster
15/06/08 19:03:38 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-0d14f274-6724-4ead-89f2-8ff1975d6e72/blockmgr-2b418147-083f-4bc9-9375-25066f3a2495
15/06/08 19:03:38 INFO storage.MemoryStore: MemoryStore started with
capacity 265.1 MB
15/06/08 19:03:38 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-e1576f6e-9aa7-4102-92e8-d227e9a00ff6/httpd-1f2ca961-d98e-4f19-9591-874dd74c833f
15/06/08 19:03:38 INFO spark.HttpServer: Starting HTTP Server
15/06/08 19:03:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/08 19:03:38 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:47487
15/06/08 19:03:38 INFO util.Utils: Successfully started service 'HTTP file
server' on port 47487.
15/06/08 19:03:38 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/06/08 19:03:38 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/06/08 19:03:38 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/06/08 19:03:38 INFO util.Utils: Successfully started service 'SparkUI'
on port 4040.
15/06/08 19:03:38 INFO ui.SparkUI: Started SparkUI at
http://edadashov-wsl.internal.salesforce.com:4040
15/06/08 19:03:38 INFO spark.SparkContext: Added JAR
file:/home/edadashov/tools/spark-1.3.1-bin-cdh4/lib/spark-examples-1.3.1-hadoop2.0.0-mr1-cdh4.2.0.jar
at
http://10.0.53.59:47487/jars/spark-examples-1.3.1-hadoop2.0.0-mr1-cdh4.2.0.jar
with timestamp 1433815418402
15/06/08 19:03:38 INFO executor.Executor: Starting executor ID driver on
host localhost
15/06/08 19:03:38 INFO util.AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://
sparkdri...@edadashov-wsl.internal.salesforce.com:46368/user/HeartbeatReceiver
15/06/08 19:03:38 INFO netty.NettyBlockTransferService: Server created on
41404
15/06/08 19:03:38 INFO storage.BlockManagerMaster: Trying to register
BlockManager
15/06/08 19:03:38 INFO storage.BlockManagerMasterActor: Registering block
manager localhost:41404 with 265.1 MB RAM, BlockManagerId(driver,
localhost, 41404)
15/06/08 19:03:38 INFO storage.BlockManagerMaster: Registered BlockManager
15/06/08 19:03:38 INFO spark.SparkContext: Starting job: reduce at
SparkPi.scala:35
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Got job 0 (reduce at
SparkPi.scala:35) with 4 output partitions (allowLocal=false)
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Final stage: Stage 0(reduce
at SparkPi.scala:35)
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Parents of final stage:
List()
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Missing parents: List()
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Submitting Stage 0
(MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing
parents
15/06/08 19:03:38 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Stage 0 (reduce at
SparkPi.scala:35) failed in Unknown s
15/06/08 19:03:38 INFO scheduler.DAGScheduler: Job 0 failed: reduce at
SparkPi.scala:35, took 0.063253 s
Exception in thread main org.apache.spark.SparkException: Job