Re: How to extract data in parallel from RDBMS tables

2019-03-28 Thread Surendra , Manchikanti
Hi Jason,

Thanks for your reply, But I am looking for a way to parallelly extract
all the tables in a Database.


On Thu, Mar 28, 2019 at 2:50 PM Jason Nerothin 
wrote:

> Yes.
>
> If you use the numPartitions option, your max parallelism will be that
> number. See also: partitionColumn, lowerBound, and upperBound
>
> https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
>
> On Wed, Mar 27, 2019 at 23:06 Surendra , Manchikanti <
> surendra.manchika...@gmail.com> wrote:
>
>> Hi All,
>>
>> Is there any way to copy all the tables in parallel from RDBMS using
>> Spark? We are looking for a functionality similar to Sqoop.
>>
>> Thanks,
>> Surendra
>>
>> --
> Thanks,
> Jason
>


unsubscribe

2019-03-28 Thread Byron Lee
unsubscribe


Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
Meant this one: https://docs.databricks.com/api/latest/jobs.html

On Thu, Mar 28, 2019 at 5:06 PM Pat Ferrel  wrote:

> Thanks, are you referring to
> https://github.com/spark-jobserver/spark-jobserver or the undocumented
> REST job server included in Spark?
>
>
> From: Jason Nerothin  
> Reply: Jason Nerothin  
> Date: March 28, 2019 at 2:53:05 PM
> To: Pat Ferrel  
> Cc: Felix Cheung  , 
> Marcelo
> Vanzin  , user
>  
> Subject:  Re: spark.submit.deployMode: cluster
>
> Check out the Spark Jobs API... it sits behind a REST service...
>
>
> On Thu, Mar 28, 2019 at 12:29 Pat Ferrel  wrote:
>
>> ;-)
>>
>> Great idea. Can you suggest a project?
>>
>> Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
>> launches trivially in test apps since most uses are as a lib.
>>
>>
>> From: Felix Cheung 
>> 
>> Reply: Felix Cheung 
>> 
>> Date: March 28, 2019 at 9:42:31 AM
>> To: Pat Ferrel  , Marcelo
>> Vanzin  
>> Cc: user  
>> Subject:  Re: spark.submit.deployMode: cluster
>>
>> If anyone wants to improve docs please create a PR.
>>
>> lol
>>
>>
>> But seriously you might want to explore other projects that manage job
>> submission on top of spark instead of rolling your own with spark-submit.
>>
>>
>> --
>> *From:* Pat Ferrel 
>> *Sent:* Tuesday, March 26, 2019 2:38 PM
>> *To:* Marcelo Vanzin
>> *Cc:* user
>> *Subject:* Re: spark.submit.deployMode: cluster
>>
>> Ahh, thank you indeed!
>>
>> It would have saved us a lot of time if this had been documented. I know,
>> OSS so contributions are welcome… I can also imagine your next comment; “If
>> anyone wants to improve docs see the Apache contribution rules and create a
>> PR.” or something like that.
>>
>> BTW the code where the context is known and can be used is what I’d call
>> a Driver and since all code is copied to nodes and is know in jars, it was
>> not obvious to us that this rule existed but it does make sense.
>>
>> We will need to refactor our code to use spark-submit it appears.
>>
>> Thanks again.
>>
>>
>> From: Marcelo Vanzin  
>> Reply: Marcelo Vanzin  
>> Date: March 26, 2019 at 1:59:36 PM
>> To: Pat Ferrel  
>> Cc: user  
>> Subject:  Re: spark.submit.deployMode: cluster
>>
>> If you're not using spark-submit, then that option does nothing.
>>
>> If by "context creation API" you mean "new SparkContext()" or an
>> equivalent, then you're explicitly creating the driver inside your
>> application.
>>
>> On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>> >
>> > I have a server that starts a Spark job using the context creation API.
>> It DOES NOY use spark-submit.
>> >
>> > I set spark.submit.deployMode = “cluster”
>> >
>> > In the GUI I see 2 workers with 2 executors. The link for running
>> application “name” goes back to my server, the machine that launched the
>> job.
>> >
>> > This is spark.submit.deployMode = “client” according to the docs. I set
>> the Driver to run on the cluster but it runs on the client, ignoring the
>> spark.submit.deployMode.
>> >
>> > Is this as expected? It is documented nowhere I can find.
>> >
>>
>>
>> --
>> Marcelo
>>
>> --
> Thanks,
> Jason
>
>

-- 
Thanks,
Jason


BLAS library class def not found error

2019-03-28 Thread Serena S Yuan
Hi,
   I was using the apache spark machine learning library in java
(posted this issue at
https://stackoverflow.com/questions/55367722/apache-spark-in-java-machine-learning-com-github-fommil-netlib-f2jblas-dscalf?noredirect=1#comment97464462_55367722
), and
I had an error while trying to train the logistic regression
classifier:

WARN  BLAS:61 - Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
WARN  BLAS:61 - Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
Exception in thread "main" java.lang.NoClassDefFoundError: org/netlib/blas/Dscal
at com.github.fommil.netlib.F2jBLAS.dscal(F2jBLAS.java:176)
at org.apache.spark.ml.linalg.BLAS$.scal(BLAS.scala:223)

I have included the following in my pom file:

http://maven.apache.org/POM/4.0.0;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
  4.0.0
  javaspark
  javaspark
  0.0.1-SNAPSHOT
  
   
org.apache.spark
spark-core_2.11
2.0.2


  org.apache.spark
  spark-sql_2.11
  2.0.2
  compile
  
  
  org.apache.spark
  spark-mllib_2.11
  2.0.2
  
  
  com.github.fommil.netlib
  all
  1.1.2
  pom


net.sourceforge.f2j
arpack_combined_all
0.1

  
  
src

  
maven-compiler-plugin
3.8.0

  1.8
  1.8

  

  



Any suggestions?

Thank you,
 Serena Sian Yuan

-- 
Sian Ees Super.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
Thanks, are you referring to
https://github.com/spark-jobserver/spark-jobserver or the undocumented REST
job server included in Spark?


From: Jason Nerothin  
Reply: Jason Nerothin  
Date: March 28, 2019 at 2:53:05 PM
To: Pat Ferrel  
Cc: Felix Cheung 
, Marcelo
Vanzin  , user
 
Subject:  Re: spark.submit.deployMode: cluster

Check out the Spark Jobs API... it sits behind a REST service...


On Thu, Mar 28, 2019 at 12:29 Pat Ferrel  wrote:

> ;-)
>
> Great idea. Can you suggest a project?
>
> Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
> launches trivially in test apps since most uses are as a lib.
>
>
> From: Felix Cheung  
> Reply: Felix Cheung 
> 
> Date: March 28, 2019 at 9:42:31 AM
> To: Pat Ferrel  , Marcelo
> Vanzin  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If anyone wants to improve docs please create a PR.
>
> lol
>
>
> But seriously you might want to explore other projects that manage job
> submission on top of spark instead of rolling your own with spark-submit.
>
>
> --
> *From:* Pat Ferrel 
> *Sent:* Tuesday, March 26, 2019 2:38 PM
> *To:* Marcelo Vanzin
> *Cc:* user
> *Subject:* Re: spark.submit.deployMode: cluster
>
> Ahh, thank you indeed!
>
> It would have saved us a lot of time if this had been documented. I know,
> OSS so contributions are welcome… I can also imagine your next comment; “If
> anyone wants to improve docs see the Apache contribution rules and create a
> PR.” or something like that.
>
> BTW the code where the context is known and can be used is what I’d call a
> Driver and since all code is copied to nodes and is know in jars, it was
> not obvious to us that this rule existed but it does make sense.
>
> We will need to refactor our code to use spark-submit it appears.
>
> Thanks again.
>
>
> From: Marcelo Vanzin  
> Reply: Marcelo Vanzin  
> Date: March 26, 2019 at 1:59:36 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If you're not using spark-submit, then that option does nothing.
>
> If by "context creation API" you mean "new SparkContext()" or an
> equivalent, then you're explicitly creating the driver inside your
> application.
>
> On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
> >
> > I have a server that starts a Spark job using the context creation API.
> It DOES NOY use spark-submit.
> >
> > I set spark.submit.deployMode = “cluster”
> >
> > In the GUI I see 2 workers with 2 executors. The link for running
> application “name” goes back to my server, the machine that launched the
> job.
> >
> > This is spark.submit.deployMode = “client” according to the docs. I set
> the Driver to run on the cluster but it runs on the client, ignoring the
> spark.submit.deployMode.
> >
> > Is this as expected? It is documented nowhere I can find.
> >
>
>
> --
> Marcelo
>
> --
Thanks,
Jason


Re: spark.submit.deployMode: cluster

2019-03-28 Thread Jason Nerothin
Check out the Spark Jobs API... it sits behind a REST service...


On Thu, Mar 28, 2019 at 12:29 Pat Ferrel  wrote:

> ;-)
>
> Great idea. Can you suggest a project?
>
> Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
> launches trivially in test apps since most uses are as a lib.
>
>
> From: Felix Cheung  
> Reply: Felix Cheung 
> 
> Date: March 28, 2019 at 9:42:31 AM
> To: Pat Ferrel  , Marcelo
> Vanzin  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If anyone wants to improve docs please create a PR.
>
> lol
>
>
> But seriously you might want to explore other projects that manage job
> submission on top of spark instead of rolling your own with spark-submit.
>
>
> --
> *From:* Pat Ferrel 
> *Sent:* Tuesday, March 26, 2019 2:38 PM
> *To:* Marcelo Vanzin
> *Cc:* user
> *Subject:* Re: spark.submit.deployMode: cluster
>
> Ahh, thank you indeed!
>
> It would have saved us a lot of time if this had been documented. I know,
> OSS so contributions are welcome… I can also imagine your next comment; “If
> anyone wants to improve docs see the Apache contribution rules and create a
> PR.” or something like that.
>
> BTW the code where the context is known and can be used is what I’d call a
> Driver and since all code is copied to nodes and is know in jars, it was
> not obvious to us that this rule existed but it does make sense.
>
> We will need to refactor our code to use spark-submit it appears.
>
> Thanks again.
>
>
> From: Marcelo Vanzin  
> Reply: Marcelo Vanzin  
> Date: March 26, 2019 at 1:59:36 PM
> To: Pat Ferrel  
> Cc: user  
> Subject:  Re: spark.submit.deployMode: cluster
>
> If you're not using spark-submit, then that option does nothing.
>
> If by "context creation API" you mean "new SparkContext()" or an
> equivalent, then you're explicitly creating the driver inside your
> application.
>
> On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
> >
> > I have a server that starts a Spark job using the context creation API.
> It DOES NOY use spark-submit.
> >
> > I set spark.submit.deployMode = “cluster”
> >
> > In the GUI I see 2 workers with 2 executors. The link for running
> application “name” goes back to my server, the machine that launched the
> job.
> >
> > This is spark.submit.deployMode = “client” according to the docs. I set
> the Driver to run on the cluster but it runs on the client, ignoring the
> spark.submit.deployMode.
> >
> > Is this as expected? It is documented nowhere I can find.
> >
>
>
> --
> Marcelo
>
> --
Thanks,
Jason


Re: How to extract data in parallel from RDBMS tables

2019-03-28 Thread Jason Nerothin
Yes.

If you use the numPartitions option, your max parallelism will be that
number. See also: partitionColumn, lowerBound, and upperBound

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

On Wed, Mar 27, 2019 at 23:06 Surendra , Manchikanti <
surendra.manchika...@gmail.com> wrote:

> Hi All,
>
> Is there any way to copy all the tables in parallel from RDBMS using
> Spark? We are looking for a functionality similar to Sqoop.
>
> Thanks,
> Surendra
>
> --
Thanks,
Jason


Re: Spark Profiler

2019-03-28 Thread bo yang
Yeah, these options are very valuable. Just add another option :) We build
a jvm profiler (https://github.com/uber-common/jvm-profiler) to monitor and
profile Spark applications in large scale (e.g. sending metrics to kafka /
hive for batch analysis). People could try it as well.


On Wed, Mar 27, 2019 at 1:49 PM Jack Kolokasis 
wrote:

> Thanks for your reply.  Your help is very valuable and all these links are
> helpful (especially your example)
>
> Best Regards
>
> --Iacovos
> On 3/27/19 10:42 PM, Luca Canali wrote:
>
> I find that the Spark metrics system is quite useful to gather resource
> utilization metrics of Spark applications, including CPU, memory and I/O.
>
> If you are interested an example how this works for us at:
> https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark
> If instead you are rather looking at ways to instrument your Spark code
> with performance metrics, Spark task metrics and event listeners are quite
> useful for that. See also
> https://github.com/apache/spark/blob/master/docs/monitoring.md and
> https://github.com/LucaCanali/sparkMeasure
>
>
>
> Regards,
>
> Luca
>
>
>
> *From:* manish ranjan  
> *Sent:* Tuesday, March 26, 2019 15:24
> *To:* Jack Kolokasis  
> *Cc:* user  
> *Subject:* Re: Spark Profiler
>
>
>
> I have found ganglia very helpful in understanding network I/o , CPU and
> memory usage  for a given spark cluster.
>
> I have not used , but have heard good things about Dr Elephant ( which I
> think was contributed by LinkedIn but not 100%sure).
>
>
>
> On Tue, Mar 26, 2019, 5:59 AM Jack Kolokasis 
> wrote:
>
> Hello all,
>
>  I am looking for a spark profiler to trace my application to find
> the bottlenecks. I need to trace CPU usage, Memory Usage and I/O usage.
>
> I am looking forward for your reply.
>
> --Iacovos
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: spark.submit.deployMode: cluster

2019-03-28 Thread Pat Ferrel
;-)

Great idea. Can you suggest a project?

Apache PredictionIO uses spark-submit (very ugly) and Apache Mahout only
launches trivially in test apps since most uses are as a lib.


From: Felix Cheung  
Reply: Felix Cheung  
Date: March 28, 2019 at 9:42:31 AM
To: Pat Ferrel  , Marcelo
Vanzin  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If anyone wants to improve docs please create a PR.

lol


But seriously you might want to explore other projects that manage job
submission on top of spark instead of rolling your own with spark-submit.


--
*From:* Pat Ferrel 
*Sent:* Tuesday, March 26, 2019 2:38 PM
*To:* Marcelo Vanzin
*Cc:* user
*Subject:* Re: spark.submit.deployMode: cluster

Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know,
OSS so contributions are welcome… I can also imagine your next comment; “If
anyone wants to improve docs see the Apache contribution rules and create a
PR.” or something like that.

BTW the code where the context is known and can be used is what I’d call a
Driver and since all code is copied to nodes and is know in jars, it was
not obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin  
Reply: Marcelo Vanzin  
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel  wrote:
>
> I have a server that starts a Spark job using the context creation API.
It DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running
application “name” goes back to my server, the machine that launched the
job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set
the Driver to run on the cluster but it runs on the client, ignoring the
spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


--
Marcelo


Re: Where does the Driver run?

2019-03-28 Thread Pat Ferrel
Thanks for the pointers. We’ll investigate.

We have been told that the “Driver” is run in the launching JVM because
deployMode = cluster is ignored if spark-submit is not used to launch.

You are saying that there is a loophole and if you use one of these client
classes there is a way to run part of the app on the cluster, and you have
seen this for Yarn?

To explain more, we create a SparkConf, and then a SparkContext, which we
pass around implicitly to functions that I would define as the Spark
Driver. It seems that if you do not use spark-submit, the entire launching
app/JVM process is considered the Driver AND is always run in client mode.

I hope your loophole pays off or we will have to do a major refactoring.


From: Jianneng Li  
Reply: Jianneng Li  
Date: March 28, 2019 at 2:03:47 AM
To: p...@occamsmachete.com  
Cc: andrew.m...@gmail.com  ,
user@spark.apache.org  ,
ak...@hacked.work  
Subject:  Re: Where does the Driver run?

Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail
about how you "launch" the job (i.e. how the SparkContext is created), so
it's hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN
in cluster mode by creating a SparkConf like you did and using it to call
this class:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I haven't tried this myself, but for standalone mode you might be able to
use this:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala

Lastly, you can always check where Spark processes run by executing ps on
the machine, i.e. `ps aux | grep java`.

Best,

Jianneng



*From:* Pat Ferrel 
*Date:* Monday, March 25, 2019 at 12:58 PM
*To:* Andrew Melo 
*Cc:* user , Akhil Das 
*Subject:* Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize
code to be executed in executors to various nodes. It also seems possible
to serialize the “driver” bits of code although I’m not sure how the
boundary would be defined. All code is in the jars we pass to Spark so
until now I did not question the docs.



I see no mention of a distinction between running a driver in spark-submit
vs being programmatically launched for any of the Spark Master types:
Standalone, Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines
for different algorithms. Some of these use Spark so it is highly desirable
to offload driver code to the cluster since we don’t want the diver
embedded in the Server process. The Driver portion of our training workflow
could be very large indeed and so could force the scaling of the server to
worst case.



I hope someone knows how to run “Driver” code on the cluster when our
server is launching the code. So deployMode = cluster, deploy method =
programatic launch.




From: Andrew Melo  
Reply: Andrew Melo  
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel  
Cc: Akhil Das  , user
 
Subject:  Re: Where does the Driver run?



Hi Pat,



Indeed, I don't think that it's possible to use cluster mode w/o
spark-submit. All the docs I see appear to always describe needing to use
spark-submit for cluster mode -- it's not even compatible with spark-shell.
But it makes sense to me -- if you want Spark to run your application's
driver, you need to package it up and send it to the cluster manager. You
can't start spark one place and then later migrate it to the cluster. It's
also why you can't use spark-shell in cluster mode either, I think.



Cheers

Andrew



On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel  wrote:

In the GUI while the job is running the app-id link brings up logs to both
executors, The “name” link goes to 4040 of the machine that launched the
job but is not resolvable right now so the page is not shown. I’ll try the
netstat but the use of port 4040 was a good clue.



By what you say below this indicates the Driver is running on the launching
machine, the client to the Spark Cluster. This should be the case in
deployMode = client.



Can someone explain what us going on? The Evidence seems to say that
deployMode = cluster *does not work* as described unless you use
spark-submit (and I’m only guessing at that).



Further; if we don’t use spark-submit we can’t use deployMode = cluster ???




From: Akhil Das  
Reply: Akhil Das  
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel  
Cc: user  
Subject:  Re: Where does the Driver run?



There's also a driver ui (usually available on port 4040), after running
your code, I assume you are running it on your machine, visit
localhost:4040 and you will get the driver UI.



If you think the driver is running on your master/executor nodes, login to
those machines and do a



   netstat -napt | grep -I listen



You will see the driver 

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Felix Cheung
If anyone wants to improve docs please create a PR.

lol


But seriously you might want to explore other projects that manage job 
submission on top of spark instead of rolling your own with spark-submit.



From: Pat Ferrel 
Sent: Tuesday, March 26, 2019 2:38 PM
To: Marcelo Vanzin
Cc: user
Subject: Re: spark.submit.deployMode: cluster

Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know, OSS 
so contributions are welcome… I can also imagine your next comment; “If anyone 
wants to improve docs see the Apache contribution rules and create a PR.” or 
something like that.

BTW the code where the context is known and can be used is what I’d call a 
Driver and since all code is copied to nodes and is know in jars, it was not 
obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin 
Reply: Marcelo Vanzin 
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel 
Cc: user 
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel 
mailto:p...@occamsmachete.com>> wrote:
>
> I have a server that starts a Spark job using the context creation API. It 
> DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running application 
> “name” goes back to my server, the machine that launched the job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set the 
> Driver to run on the cluster but it runs on the client, ignoring the 
> spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


--
Marcelo


Adaptive query execution and CBO

2019-03-28 Thread Tomasz Krol
I asked this question while ago on StackOverflow but got no response, so
trying here:)

Whats your experience with using adaptive query execution and CBO? Do you
use them enabled together? or seperate? Do you experience any issues using
them? For example Ive seen that bucketing doesnt work properly (sort merge
join happens) with adaptive qe enable.

Thanks

Tom


-- 
Tomasz Krol
patric...@gmail.com


Re: Where does the Driver run?

2019-03-28 Thread Mich Talebzadeh
Hi,

I have explained this in my following Linkedlin article "The Operational
Advantages of Spark as a Distributed Processing Framework

"

An extract

*2) YARN Deployment Modes*

The term D*eployment mode of Spark*, simply means that “where the driver
program will be run”. There are two ways, namely; *Spark Client Mode*
* and **Spark
Cluster Mode* 
*.* These are described below:

*In the Client mode,* *the driver daemon runs in the node through which you
submit the spark job to your cluster.* This is often done through the Edge
Node. This mode is valuable when you want to use spark interactively like
in our case where we would like to display high value prices in the
dashboard. In the Client mode you do not want to reserve any resource from
your cluster for the driver daemon

*In Cluster mode,* *you submit the spark job to your cluster and the driver
daemon is run inside your cluster and application master*. In this mode you
do not get to use the spark job interactively as the client through which
you submit the job is gone as soon as it successfully submits the job to
cluster. You will have to reserve some resources for the driver daemon
process as it will be running in your cluster.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 23 Mar 2019 at 21:13, Pat Ferrel  wrote:

> I have researched this for a significant amount of time and find answers
> that seem to be for a slightly different question than mine.
>
> The Spark 2.3.3 cluster is running fine. I see the GUI on “
> http://master-address:8080;, there are 2 idle workers, as configured.
>
> I have a Scala application that creates a context and starts execution of
> a Job. I *do not use spark-submit*, I start the Job programmatically and
> this is where many explanations forks from my question.
>
> In "my-app" I create a new SparkConf, with the following code (slightly
> abbreviated):
>
>   conf.setAppName(“my-job")
>   conf.setMaster(“spark://master-address:7077”)
>   conf.set(“deployMode”, “cluster”)
>   // other settings like driver and executor memory requests
>   // the driver and executor memory requests are for all mem on the
> slaves, more than
>   // mem available on the launching machine with “my-app"
>   val jars = listJars(“/path/to/lib")
>   conf.setJars(jars)
>   …
>
> When I launch the job I see 2 executors running on the 2 workers/slaves.
> Everything seems to run fine and sometimes completes successfully. Frequent
> failures are the reason for this question.
>
> Where is the Driver running? I don’t see it in the GUI, I see 2 Executors
> taking all cluster resources. With a Yarn cluster I would expect the
> “Driver" to run on/in the Yarn Master but I am using the Spark Standalone
> Master, where is the Drive part of the Job running?
>
> If is is running in the Master, we are in trouble because I start the
> Master on one of my 2 Workers sharing resources with one of the Executors.
> Executor mem + driver mem is > available mem on a Worker. I can change this
> but need so understand where the Driver part of the Spark Job runs. Is it
> in the Spark Master, or inside and Executor, or ???
>
> The “Driver” creates and broadcasts some large data structures so the need
> for an answer is more critical than with more typical tiny Drivers.
>
> Thanks for you help!
>


Re: Streaming data out of spark to a Kafka topic

2019-03-28 Thread Mich Talebzadeh
Hi Gabor,

I will look at the link and see what it provides.

Thanks,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 27 Mar 2019 at 21:23, Gabor Somogyi 
wrote:

> Hi Mich,
>
> Please take a look at how to write data into Kafka topic with DStreams:
> https://github.com/gaborgsomogyi/spark-dstream-secure-kafka-sink-app/blob/62d64ce368bc07b385261f85f44971b32fe41327/src/main/scala/com/cloudera/spark/examples/DirectKafkaSinkWordCount.scala#L77
> (DStreams has no native Kafka sink, if you need it use Structured
> Streaming)
>
> BR,
> G
>
>
> On Wed, Mar 27, 2019 at 8:47 PM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> In a traditional we get data via Kafka into Spark streaming, do some work
>> and write to a NoSQL database like Mongo, Hbase or Aerospike.
>>
>> That part can be done below and is best explained by the code as follows:
>>
>> Once a high value DF lookups is created I want send the data to a new
>> topic for recipients!
>>
>> val kafkaParams = Map[String, String](
>>   "bootstrap.servers" ->
>> bootstrapServers,
>>   "schema.registry.url" ->
>> schemaRegistryURL,
>>"zookeeper.connect" ->
>> zookeeperConnect,
>>"group.id" -> sparkAppName,
>>"zookeeper.connection.timeout.ms"
>> -> zookeeperConnectionTimeoutMs,
>>"rebalance.backoff.ms" ->
>> rebalanceBackoffMS,
>>"zookeeper.session.timeout.ms" ->
>> zookeeperSessionTimeOutMs,
>>"auto.commit.interval.ms" ->
>> autoCommitIntervalMS
>>  )
>> //val topicsSet = topics.split(",").toSet
>> val topics = Set(topicsValue)
>> val dstream = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
>> // This returns a tuple of key and value (since messages in Kafka are
>> optionally keyed). In this case it is of type (String, String)
>> dstream.cache()
>> //
>> val topicsOut = Set(topicsValueOut)
>> val dstreamOut = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsOut)
>> dstreamOut.cache()
>>
>>
>> dstream.foreachRDD
>> { pricesRDD =>
>>   if (!pricesRDD.isEmpty)  // data exists in RDD
>>   {
>> val op_time = System.currentTimeMillis.toString
>> val spark =
>> SparkSessionSingleton.getInstance(pricesRDD.sparkContext.getConf)
>> val sc = spark.sparkContext
>> import spark.implicits._
>> var operation = new operationStruct(op_type, op_time)
>> // Convert RDD[String] to RDD[case class] to DataFrame
>> val RDDString = pricesRDD.map { case (_, value) =>
>> value.split(',') }.map(p =>
>> priceDocument(priceStruct(p(0).toString,p(1).toString,p(2).toString,p(3).toDouble,
>> currency), operation))
>> val df = spark.createDataFrame(RDDString)
>> //df.printSchema
>> var document = df.filter('priceInfo.getItem("price") > 90.0)
>> MongoSpark.save(document, writeConfig)
>>  println("Current time is: " + Calendar.getInstance.getTime)
>>  totalPrices += document.count
>>  var endTimeQuery = System.currentTimeMillis
>>  println("Total Prices added to the collection so far: "
>> +totalPrices+ " , Runnig for  " + (endTimeQuery -
>> startTimeQuery)/(1000*60)+" Minutes")
>>  // Check if running time > runTime exit
>>  if( (endTimeQuery - startTimeQuery)/(10*60) > runTime)
>>  {
>>println("\nDuration exceeded " + runTime + " minutes exiting")
>>System.exit(0)
>>  }
>>  // picking up individual arrays -->
>> df.select('otherDetails.getItem("tickerQuotes")(0)) shows first element
>>  //val lookups = df.filter('priceInfo.getItem("ticker") ===
>> tickerWatch && 'priceInfo.getItem("price") > priceWatch)
>>  val lookups = df.filter('priceInfo.getItem("price") > priceWatch)
>>  if(lookups.count > 0) {
>>println("High value tickers")
>>lookups.select('priceInfo.getItem("timeissued").as("Time
>> issued"), 'priceInfo.getItem("ticker").as("Ticker"),
>> 'priceInfo.getItem("price").cast("Double").as("Latest price")).show

Re: Where does the Driver run?

2019-03-28 Thread Jianneng Li
Hi Pat,

The driver runs in the same JVM as SparkContext. You didn't go into detail 
about how you "launch" the job (i.e. how the SparkContext is created), so it's 
hard for me to guess where the driver is.

For reference, we've had success launching Spark programmatically to YARN in 
cluster mode by creating a SparkConf like you did and using it to call this 
class: 
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

I haven't tried this myself, but for standalone mode you might be able to use 
this: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/Client.scala

Lastly, you can always check where Spark processes run by executing ps on the 
machine, i.e. `ps aux | grep java`.

Best,

Jianneng




From: Pat Ferrel 
Date: Monday, March 25, 2019 at 12:58 PM
To: Andrew Melo 
Cc: user , Akhil Das 
Subject: Re: Where does the Driver run?



I’m beginning to agree with you and find it rather surprising that this is 
mentioned nowhere explicitly (maybe I missed?). It is possible to serialize 
code to be executed in executors to various nodes. It also seems possible to 
serialize the “driver” bits of code although I’m not sure how the boundary 
would be defined. All code is in the jars we pass to Spark so until now I did 
not question the docs.



I see no mention of a distinction between running a driver in spark-submit vs 
being programmatically launched for any of the Spark Master types: Standalone, 
Yarn, Mesos, k8s.



We are building a Machine Learning Server in OSS. It has pluggable Engines for 
different algorithms. Some of these use Spark so it is highly desirable to 
offload driver code to the cluster since we don’t want the diver embedded in 
the Server process. The Driver portion of our training workflow could be very 
large indeed and so could force the scaling of the server to worst case.



I hope someone knows how to run “Driver” code on the cluster when our server is 
launching the code. So deployMode = cluster, deploy method = programatic launch.



From: Andrew Melo 
Reply: Andrew Melo 
Date: March 25, 2019 at 11:40:07 AM
To: Pat Ferrel 
Cc: Akhil Das , user 

Subject:  Re: Where does the Driver run?



Hi Pat,



Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. 
All the docs I see appear to always describe needing to use spark-submit for 
cluster mode -- it's not even compatible with spark-shell. But it makes sense 
to me -- if you want Spark to run your application's driver, you need to 
package it up and send it to the cluster manager. You can't start spark one 
place and then later migrate it to the cluster. It's also why you can't use 
spark-shell in cluster mode either, I think.



Cheers

Andrew



On Mon, Mar 25, 2019 at 11:22 AM Pat Ferrel 
mailto:p...@occamsmachete.com>> wrote:

In the GUI while the job is running the app-id link brings up logs to both 
executors, The “name” link goes to 4040 of the machine that launched the job 
but is not resolvable right now so the page is not shown. I’ll try the netstat 
but the use of port 4040 was a good clue.



By what you say below this indicates the Driver is running on the launching 
machine, the client to the Spark Cluster. This should be the case in deployMode 
= client.



Can someone explain what us going on? The Evidence seems to say that deployMode 
= cluster does not work as described unless you use spark-submit (and I’m only 
guessing at that).



Further; if we don’t use spark-submit we can’t use deployMode = cluster ???



From: Akhil Das 
Reply: Akhil Das 
Date: March 24, 2019 at 7:45:07 PM
To: Pat Ferrel 
Cc: user 
Subject:  Re: Where does the Driver run?



There's also a driver ui (usually available on port 4040), after running your 
code, I assume you are running it on your machine, visit localhost:4040 and you 
will get the driver UI.



If you think the driver is running on your master/executor nodes, login to 
those machines and do a



   netstat -napt | grep -I listen



You will see the driver listening on 404x there, this won't be the case mostly 
as you are not doing Spark-submit or using the deployMode=cluster.



On Mon, 25 Mar 2019, 01:03 Pat Ferrel, 
mailto:p...@occamsmachete.com>> wrote:

Thanks, I have seen this many times in my research. Paraphrasing docs: “in 
deployMode ‘cluster' the Driver runs on a Worker in the cluster”



When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1 with 
addresses that match slaves). When I look at memory usage while the job runs I 
see virtually identical usage on the 2 Workers. This would support your claim 
and contradict Spark docs for deployMode = cluster.