Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Sean Owen
NFS is a simple option for this kind of usage, yes.
But --files is making N copies of the data - you may not want to do that
for large data, or for data that you need to mutate.

On Wed, Nov 25, 2020 at 9:16 PM Artemis User  wrote:

> Ah, I almost forgot that there is an even easier solution for your
> problem, namely to use the --files option in spark-submit.  Usage as
> follows:
>
> --files FILES   Comma-separated list of files to be placed in the
> working
>   directory of each executor. File paths of
> these files
>   in executors can be accessed via
> SparkFiles.get(fileName).
>
> -- ND
> On 11/25/20 9:51 PM, Artemis User wrote:
>
> This is a typical file sharing problem in Spark.  Just setting up HDFS
> won't solve the problem unless you make your local machine as part of the
> cluster.  Spark server doesn't share files with your local machine without
> mounting drives to each other.  The best/easiest way to share the data
> between your local machine and the Spark server machine is to use NFS (as
> Spark manual suggests).  You can use a common NFS server and mount
> /opt/data drive on both local and the server machine, or run NFS on either
> machine and mount the /opt/data on the other.  Regardless, you have to
> ensure that /opt/data on both local and server machine are pointing to the
> save physical drive.  Also don't forget to relax the read/write permissions
> for all on the drive or map the user ID on both machines.
>
> Using Fuse may be an option on Mac, but NFS is the standard solution for
> this type of problem (Mac supports NFS as well).
>
> -- ND
> On 11/25/20 10:34 AM, Ryan Victory wrote:
>
> A key part of what I'm trying to do involves NOT having to bring the data
> "through" the driver in order to get the cluster to work on it (which would
> involve a network hop from server to laptop and another from laptop to
> server). I'd rather have the data stay on the server and the driver stay on
> my laptop if possible, but I'm guessing the Spark APIs/topology wasn't
> designed that way.
>
> What I was hoping for was some way to be able to say val df =
> spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`")
> or similar to convince Spark to not move the data. I'd imagine if I used
> HDFS, data locality would kick in anyways to prevent the network shuffles
> between the driver and the cluster, but even then I wonder (based on what
> you guys are saying) if I'm wrong.
>
> Perhaps I'll just have to modify the workflow to move the JAR to the
> server and execute it from there. This isn't ideal but it's better than
> nothing.
>
> -Ryan
>
> On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho 
> wrote:
>
>> I'm also curious if this is possible, so while I can't offer a solution
>> maybe you could try the following.
>>
>> The driver and executor nodes need to have access to the same
>> (distributed) file system, so you could try to mount the file system to
>> your laptop, locally, and then try to submit jobs and/or use the
>> spark-shell while connected to the same system.
>>
>> A quick google search led me to find this article where someone shows how
>> to mount an HDFS locally. It appears that Cloudera supports some kind of
>> FUSE-based library, which may be useful for your use-case.
>>
>> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
>>
>> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
>>
>> Hello!
>>
>> I have been tearing my hair out trying to solve this problem. Here is my
>> setup:
>>
>> 1. I have Spark running on a server in standalone mode with data on the
>> filesystem of the server itself (/opt/data/).
>> 2. I have an instance of a Hive Metastore server running (backed by
>> MariaDB) on the same server
>> 3. I have a laptop where I am developing my spark jobs (Scala)
>>
>> I have configured Spark to use the metastore and set the warehouse
>> directory to be in /opt/data/warehouse/. What I am trying to accomplish are
>> a couple of things:
>>
>> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
>> have the driver run on my local machine (my laptop). I want the jobs to use
>> the data ON THE SERVER and not try to reference it from my local machine.
>> If I do something like this:
>>
>> val df = spark.sql("SELECT * FROM
>> parquet.`/opt/data/transactions.parquet`")
>>
>> I get an error that the path doesn't exist (because it's trying to find
>> it on my laptop). If I run the same thing in a spark-shell on the spark
>> server itself, there isn't an issue because the driver has access to the
>> data. If I submit the job with submit-mode=cluster then it works too
>> because the driver is on the cluster. I don't want this, I want to get the
>> results on my laptop.
>>
>> How can I force Spark to read the data from the cluster's filesystem and
>> not the driver's?
>>
>> 2. I have setup a Hive Metastore and created a table (in the spark shell
>> on the spark server 

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User
Ah, I almost forgot that there is an even easier solution for your 
problem, namely to use the --files option in spark-submit. Usage as follows:


--files FILES   Comma-separated list of files to be placed in 
the working
  directory of each executor. File paths of 
these files
  in executors can be accessed via 
SparkFiles.get(fileName).


-- ND

On 11/25/20 9:51 PM, Artemis User wrote:


This is a typical file sharing problem in Spark.  Just setting up HDFS 
won't solve the problem unless you make your local machine as part of 
the cluster.  Spark server doesn't share files with your local machine 
without mounting drives to each other.  The best/easiest way to share 
the data between your local machine and the Spark server machine is to 
use NFS (as Spark manual suggests).  You can use a common NFS server 
and mount /opt/data drive on both local and the server machine, or run 
NFS on either machine and mount the /opt/data on the other. 
Regardless, you have to ensure that /opt/data on both local and server 
machine are pointing to the save physical drive.  Also don't forget to 
relax the read/write permissions for all on the drive or map the user 
ID on both machines.


Using Fuse may be an option on Mac, but NFS is the standard solution 
for this type of problem (Mac supports NFS as well).


-- ND

On 11/25/20 10:34 AM, Ryan Victory wrote:
A key part of what I'm trying to do involves NOT having to bring the 
data "through" the driver in order to get the cluster to work on it 
(which would involve a network hop from server to laptop and another 
from laptop to server). I'd rather have the data stay on the server 
and the driver stay on my laptop if possible, but I'm guessing the 
Spark APIs/topology wasn't designed that way.


What I was hoping for was some way to be able to say val df = 
spark.sql("SELECT * FROM 
parquet.`*local://*/opt/data/transactions.parquet`") or similar to 
convince Spark to not move the data. I'd imagine if I used HDFS, data 
locality would kick in anyways to prevent the network shuffles 
between the driver and the cluster, but even then I wonder (based on 
what you guys are saying) if I'm wrong.


Perhaps I'll just have to modify the workflow to move the JAR to the 
server and execute it from there. This isn't ideal but it's better 
than nothing.


-Ryan

On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho 
mailto:chrisbcouti...@gmail.com>> wrote:


I'm also curious if this is possible, so while I can't offer a
solution maybe you could try the following.

The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file
system to your laptop, locally, and then try to submit jobs
and/or use the spark-shell while connected to the same system.

A quick google search led me to find this article where someone
shows how to mount an HDFS locally. It appears that Cloudera
supports some kind of FUSE-based library, which may be useful for
your use-case.

https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/


On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:

Hello!

I have been tearing my hair out trying to solve this problem.
Here is my setup:

1. I have Spark running on a server in standalone mode with data
on the filesystem of the server itself (/opt/data/).
2. I have an instance of a Hive Metastore server running (backed
by MariaDB) on the same server
3. I have a laptop where I am developing my spark jobs (Scala)

I have configured Spark to use the metastore and set the
warehouse directory to be in /opt/data/warehouse/. What I am
trying to accomplish are a couple of things:

1. I am trying to submit Spark jobs (via JARs) using
spark-submit, but have the driver run on my local machine (my
laptop). I want the jobs to use the data ON THE SERVER and not
try to reference it from my local machine. If I do something
like this:

val df = spark.sql("SELECT * FROM
parquet.`/opt/data/transactions.parquet`")

I get an error that the path doesn't exist (because it's trying
to find it on my laptop). If I run the same thing in a
spark-shell on the spark server itself, there isn't an issue
because the driver has access to the data. If I submit the job
with submit-mode=cluster then it works too because the driver is
on the cluster. I don't want this, I want to get the results on
my laptop.

How can I force Spark to read the data from the cluster's
filesystem and not the driver's?

2. I have setup a Hive Metastore and created a table (in the
spark shell on the spark server itself). The data in the
warehouse is in the local filesystem. When I create a spark
application JAR and try to run it from my laptop, I get the same

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Artemis User
This is a typical file sharing problem in Spark.  Just setting up HDFS 
won't solve the problem unless you make your local machine as part of 
the cluster.  Spark server doesn't share files with your local machine 
without mounting drives to each other.  The best/easiest way to share 
the data between your local machine and the Spark server machine is to 
use NFS (as Spark manual suggests).  You can use a common NFS server and 
mount /opt/data drive on both local and the server machine, or run NFS 
on either machine and mount the /opt/data on the other.  Regardless, you 
have to ensure that /opt/data on both local and server machine are 
pointing to the save physical drive.  Also don't forget to relax the 
read/write permissions for all on the drive or map the user ID on both 
machines.


Using Fuse may be an option on Mac, but NFS is the standard solution for 
this type of problem (Mac supports NFS as well).


-- ND

On 11/25/20 10:34 AM, Ryan Victory wrote:
A key part of what I'm trying to do involves NOT having to bring the 
data "through" the driver in order to get the cluster to work on it 
(which would involve a network hop from server to laptop and another 
from laptop to server). I'd rather have the data stay on the server 
and the driver stay on my laptop if possible, but I'm guessing the 
Spark APIs/topology wasn't designed that way.


What I was hoping for was some way to be able to say val df = 
spark.sql("SELECT * FROM 
parquet.`*local://*/opt/data/transactions.parquet`") or similar to 
convince Spark to not move the data. I'd imagine if I used HDFS, data 
locality would kick in anyways to prevent the network shuffles between 
the driver and the cluster, but even then I wonder (based on what you 
guys are saying) if I'm wrong.


Perhaps I'll just have to modify the workflow to move the JAR to the 
server and execute it from there. This isn't ideal but it's better 
than nothing.


-Ryan

On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho 
mailto:chrisbcouti...@gmail.com>> wrote:


I'm also curious if this is possible, so while I can't offer a
solution maybe you could try the following.

The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file
system to your laptop, locally, and then try to submit jobs and/or
use the spark-shell while connected to the same system.

A quick google search led me to find this article where someone
shows how to mount an HDFS locally. It appears that Cloudera
supports some kind of FUSE-based library, which may be useful for
your use-case.

https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/


On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:

Hello!

I have been tearing my hair out trying to solve this problem.
Here is my setup:

1. I have Spark running on a server in standalone mode with data
on the filesystem of the server itself (/opt/data/).
2. I have an instance of a Hive Metastore server running (backed
by MariaDB) on the same server
3. I have a laptop where I am developing my spark jobs (Scala)

I have configured Spark to use the metastore and set the
warehouse directory to be in /opt/data/warehouse/. What I am
trying to accomplish are a couple of things:

1. I am trying to submit Spark jobs (via JARs) using
spark-submit, but have the driver run on my local machine (my
laptop). I want the jobs to use the data ON THE SERVER and not
try to reference it from my local machine. If I do something like
this:

val df = spark.sql("SELECT * FROM
parquet.`/opt/data/transactions.parquet`")

I get an error that the path doesn't exist (because it's trying
to find it on my laptop). If I run the same thing in a
spark-shell on the spark server itself, there isn't an issue
because the driver has access to the data. If I submit the job
with submit-mode=cluster then it works too because the driver is
on the cluster. I don't want this, I want to get the results on
my laptop.

How can I force Spark to read the data from the cluster's
filesystem and not the driver's?

2. I have setup a Hive Metastore and created a table (in the
spark shell on the spark server itself). The data in the
warehouse is in the local filesystem. When I create a spark
application JAR and try to run it from my laptop, I get the same
problem as #1, namely that it tries to find the warehouse
directory on my laptop itself.

Am I crazy? Perhaps this isn't a supported way to use Spark? Any
help or insights are much appreciated!

-Ryan Victory




Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Jeff Evans
In your situation, I'd try to do one of the following (in decreasing order
of personal preference)

   1. Restructure things so that you can operate on a local data file, at
   least for the purpose of developing your driver logic.  Don't rely on the
   Metastore or HDFS until you have to.  Structure the application logic so it
   operates on a DataFrame (or Dataset) and doesn't care where it came
   from.  Build this data file from your real data (probably a small subset).
   2. Develop the logic using spark-shell running on a cluster node, since
   the environment will be all set up already (which, of course, you already
   mentioned).
   3. Set up remote debugging of the driver, open an SSH tunnel to the
   node, and connect from your local laptop to debug/iterate.  Figure out the
   fastest way to rebuild the jar and scp it up to try again.


On Wed, Nov 25, 2020 at 9:35 AM Ryan Victory  wrote:

> A key part of what I'm trying to do involves NOT having to bring the data
> "through" the driver in order to get the cluster to work on it (which would
> involve a network hop from server to laptop and another from laptop to
> server). I'd rather have the data stay on the server and the driver stay on
> my laptop if possible, but I'm guessing the Spark APIs/topology wasn't
> designed that way.
>
> What I was hoping for was some way to be able to say val df =
> spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`")
> or similar to convince Spark to not move the data. I'd imagine if I used
> HDFS, data locality would kick in anyways to prevent the network shuffles
> between the driver and the cluster, but even then I wonder (based on what
> you guys are saying) if I'm wrong.
>
> Perhaps I'll just have to modify the workflow to move the JAR to the
> server and execute it from there. This isn't ideal but it's better than
> nothing.
>
> -Ryan
>
> On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho 
> wrote:
>
>> I'm also curious if this is possible, so while I can't offer a solution
>> maybe you could try the following.
>>
>> The driver and executor nodes need to have access to the same
>> (distributed) file system, so you could try to mount the file system to
>> your laptop, locally, and then try to submit jobs and/or use the
>> spark-shell while connected to the same system.
>>
>> A quick google search led me to find this article where someone shows how
>> to mount an HDFS locally. It appears that Cloudera supports some kind of
>> FUSE-based library, which may be useful for your use-case.
>>
>> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
>>
>> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
>>
>> Hello!
>>
>> I have been tearing my hair out trying to solve this problem. Here is my
>> setup:
>>
>> 1. I have Spark running on a server in standalone mode with data on the
>> filesystem of the server itself (/opt/data/).
>> 2. I have an instance of a Hive Metastore server running (backed by
>> MariaDB) on the same server
>> 3. I have a laptop where I am developing my spark jobs (Scala)
>>
>> I have configured Spark to use the metastore and set the warehouse
>> directory to be in /opt/data/warehouse/. What I am trying to accomplish are
>> a couple of things:
>>
>> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
>> have the driver run on my local machine (my laptop). I want the jobs to use
>> the data ON THE SERVER and not try to reference it from my local machine.
>> If I do something like this:
>>
>> val df = spark.sql("SELECT * FROM
>> parquet.`/opt/data/transactions.parquet`")
>>
>> I get an error that the path doesn't exist (because it's trying to find
>> it on my laptop). If I run the same thing in a spark-shell on the spark
>> server itself, there isn't an issue because the driver has access to the
>> data. If I submit the job with submit-mode=cluster then it works too
>> because the driver is on the cluster. I don't want this, I want to get the
>> results on my laptop.
>>
>> How can I force Spark to read the data from the cluster's filesystem and
>> not the driver's?
>>
>> 2. I have setup a Hive Metastore and created a table (in the spark shell
>> on the spark server itself). The data in the warehouse is in the local
>> filesystem. When I create a spark application JAR and try to run it from my
>> laptop, I get the same problem as #1, namely that it tries to find the
>> warehouse directory on my laptop itself.
>>
>> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
>> insights are much appreciated!
>>
>> -Ryan Victory
>>
>>
>>


Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
A key part of what I'm trying to do involves NOT having to bring the data
"through" the driver in order to get the cluster to work on it (which would
involve a network hop from server to laptop and another from laptop to
server). I'd rather have the data stay on the server and the driver stay on
my laptop if possible, but I'm guessing the Spark APIs/topology wasn't
designed that way.

What I was hoping for was some way to be able to say val df =
spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`")
or similar to convince Spark to not move the data. I'd imagine if I used
HDFS, data locality would kick in anyways to prevent the network shuffles
between the driver and the cluster, but even then I wonder (based on what
you guys are saying) if I'm wrong.

Perhaps I'll just have to modify the workflow to move the JAR to the server
and execute it from there. This isn't ideal but it's better than nothing.

-Ryan

On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho 
wrote:

> I'm also curious if this is possible, so while I can't offer a solution
> maybe you could try the following.
>
> The driver and executor nodes need to have access to the same
> (distributed) file system, so you could try to mount the file system to
> your laptop, locally, and then try to submit jobs and/or use the
> spark-shell while connected to the same system.
>
> A quick google search led me to find this article where someone shows how
> to mount an HDFS locally. It appears that Cloudera supports some kind of
> FUSE-based library, which may be useful for your use-case.
>
> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
>
> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
>
> Hello!
>
> I have been tearing my hair out trying to solve this problem. Here is my
> setup:
>
> 1. I have Spark running on a server in standalone mode with data on the
> filesystem of the server itself (/opt/data/).
> 2. I have an instance of a Hive Metastore server running (backed by
> MariaDB) on the same server
> 3. I have a laptop where I am developing my spark jobs (Scala)
>
> I have configured Spark to use the metastore and set the warehouse
> directory to be in /opt/data/warehouse/. What I am trying to accomplish are
> a couple of things:
>
> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> have the driver run on my local machine (my laptop). I want the jobs to use
> the data ON THE SERVER and not try to reference it from my local machine.
> If I do something like this:
>
> val df = spark.sql("SELECT * FROM
> parquet.`/opt/data/transactions.parquet`")
>
> I get an error that the path doesn't exist (because it's trying to find it
> on my laptop). If I run the same thing in a spark-shell on the spark server
> itself, there isn't an issue because the driver has access to the data. If
> I submit the job with submit-mode=cluster then it works too because the
> driver is on the cluster. I don't want this, I want to get the results on
> my laptop.
>
> How can I force Spark to read the data from the cluster's filesystem and
> not the driver's?
>
> 2. I have setup a Hive Metastore and created a table (in the spark shell
> on the spark server itself). The data in the warehouse is in the local
> filesystem. When I create a spark application JAR and try to run it from my
> laptop, I get the same problem as #1, namely that it tries to find the
> warehouse directory on my laptop itself.
>
> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
> insights are much appreciated!
>
> -Ryan Victory
>
>
>


Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Chris Coutinho
I'm also curious if this is possible, so while I can't offer a solution
maybe you could try the following.

The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file system to
your laptop, locally, and then try to submit jobs and/or use the spark-
shell while connected to the same system.

A quick google search led me to find this article where someone shows
how to mount an HDFS locally. It appears that Cloudera supports some
kind of FUSE-based library, which may be useful for your use-case. 

https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/

On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
> Hello!
> I have been tearing my hair out trying to solve this problem. Here is
> my setup:
> 
> 1. I have Spark running on a server in standalone mode with data on
> the filesystem of the server itself (/opt/data/).
> 2. I have an instance of a Hive Metastore server running (backed by
> MariaDB) on the same server
> 3. I have a laptop where I am developing my spark jobs (Scala)
> 
> I have configured Spark to use the metastore and set the warehouse
> directory to be in /opt/data/warehouse/. What I am trying to
> accomplish are a couple of things:
> 
> 1. I am trying to submit Spark jobs (via JARs) using spark-submit,
> but have the driver run on my local machine (my laptop). I want the
> jobs to use the data ON THE SERVER and not try to reference it from
> my local machine. If I do something like this:
> 
> val df = spark.sql("SELECT * FROM
> parquet.`/opt/data/transactions.parquet`")
> I get an error that the path doesn't exist (because it's trying to
> find it on my laptop). If I run the same thing in a spark-shell on
> the spark server itself, there isn't an issue because the driver has
> access to the data. If I submit the job with submit-mode=cluster then
> it works too because the driver is on the cluster. I don't want this,
> I want to get the results on my laptop. 
> 
> How can I force Spark to read the data from the cluster's filesystem
> and not the driver's?
> 
> 2. I have setup a Hive Metastore and created a table (in the spark
> shell on the spark server itself). The data in the warehouse is in
> the local filesystem. When I create a spark application JAR and try
> to run it from my laptop, I get the same problem as #1, namely that
> it tries to find the warehouse directory on my laptop itself.
> 
> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> or insights are much appreciated!
> 
> -Ryan Victory



signature.asc
Description: This is a digitally signed message part


Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Thanks Apostolos,

I'm trying to avoid standing up HDFS just for this use case (single node).

-Ryan

On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:

> Hi Ryan,
>
> since the driver is at your laptop, in order to access a remote file you
> need to specify the url for this I guess.
>
> For example, when I am using Spark over HDFS I specify the file like
> hdfs://blablabla which contains the url where namenode
>
> can answer. I believe that something similar must be done here.
>
> all the best,
>
> Apostolos
>
>
> On 25/11/20 16:51, Ryan Victory wrote:
> > Hello!
> >
> > I have been tearing my hair out trying to solve this problem. Here is
> > my setup:
> >
> > 1. I have Spark running on a server in standalone mode with data on
> > the filesystem of the server itself (/opt/data/).
> > 2. I have an instance of a Hive Metastore server running (backed by
> > MariaDB) on the same server
> > 3. I have a laptop where I am developing my spark jobs (Scala)
> >
> > I have configured Spark to use the metastore and set the warehouse
> > directory to be in /opt/data/warehouse/. What I am trying to
> > accomplish are a couple of things:
> >
> > 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> > have the driver run on my local machine (my laptop). I want the jobs
> > to use the data ON THE SERVER and not try to reference it from my
> > local machine. If I do something like this:
> >
> > val df = spark.sql("SELECT * FROM
> > parquet.`/opt/data/transactions.parquet`")
> >
> > I get an error that the path doesn't exist (because it's trying to
> > find it on my laptop). If I run the same thing in a spark-shell on the
> > spark server itself, there isn't an issue because the driver has
> > access to the data. If I submit the job with submit-mode=cluster then
> > it works too because the driver is on the cluster. I don't want this,
> > I want to get the results on my laptop.
> >
> > How can I force Spark to read the data from the cluster's filesystem
> > and not the driver's?
> >
> > 2. I have setup a Hive Metastore and created a table (in the spark
> > shell on the spark server itself). The data in the warehouse is in the
> > local filesystem. When I create a spark application JAR and try to run
> > it from my laptop, I get the same problem as #1, namely that it tries
> > to find the warehouse directory on my laptop itself.
> >
> > Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> > or insights are much appreciated!
> >
> > -Ryan Victory
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Apostolos N. Papadopoulos

Hi Ryan,

since the driver is at your laptop, in order to access a remote file you 
need to specify the url for this I guess.


For example, when I am using Spark over HDFS I specify the file like 
hdfs://blablabla which contains the url where namenode


can answer. I believe that something similar must be done here.

all the best,

Apostolos


On 25/11/20 16:51, Ryan Victory wrote:

Hello!

I have been tearing my hair out trying to solve this problem. Here is 
my setup:


1. I have Spark running on a server in standalone mode with data on 
the filesystem of the server itself (/opt/data/).
2. I have an instance of a Hive Metastore server running (backed by 
MariaDB) on the same server

3. I have a laptop where I am developing my spark jobs (Scala)

I have configured Spark to use the metastore and set the warehouse 
directory to be in /opt/data/warehouse/. What I am trying to 
accomplish are a couple of things:


1. I am trying to submit Spark jobs (via JARs) using spark-submit, but 
have the driver run on my local machine (my laptop). I want the jobs 
to use the data ON THE SERVER and not try to reference it from my 
local machine. If I do something like this:


val df = spark.sql("SELECT * FROM 
parquet.`/opt/data/transactions.parquet`")


I get an error that the path doesn't exist (because it's trying to 
find it on my laptop). If I run the same thing in a spark-shell on the 
spark server itself, there isn't an issue because the driver has 
access to the data. If I submit the job with submit-mode=cluster then 
it works too because the driver is on the cluster. I don't want this, 
I want to get the results on my laptop.


How can I force Spark to read the data from the cluster's filesystem 
and not the driver's?


2. I have setup a Hive Metastore and created a table (in the spark 
shell on the spark server itself). The data in the warehouse is in the 
local filesystem. When I create a spark application JAR and try to run 
it from my laptop, I get the same problem as #1, namely that it tries 
to find the warehouse directory on my laptop itself.


Am I crazy? Perhaps this isn't a supported way to use Spark? Any help 
or insights are much appreciated!


-Ryan Victory


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Hello!

I have been tearing my hair out trying to solve this problem. Here is my
setup:

1. I have Spark running on a server in standalone mode with data on the
filesystem of the server itself (/opt/data/).
2. I have an instance of a Hive Metastore server running (backed by
MariaDB) on the same server
3. I have a laptop where I am developing my spark jobs (Scala)

I have configured Spark to use the metastore and set the warehouse
directory to be in /opt/data/warehouse/. What I am trying to accomplish are
a couple of things:

1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have
the driver run on my local machine (my laptop). I want the jobs to use the
data ON THE SERVER and not try to reference it from my local machine. If I
do something like this:

val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")

I get an error that the path doesn't exist (because it's trying to find it
on my laptop). If I run the same thing in a spark-shell on the spark server
itself, there isn't an issue because the driver has access to the data. If
I submit the job with submit-mode=cluster then it works too because the
driver is on the cluster. I don't want this, I want to get the results on
my laptop.

How can I force Spark to read the data from the cluster's filesystem and
not the driver's?

2. I have setup a Hive Metastore and created a table (in the spark shell on
the spark server itself). The data in the warehouse is in the local
filesystem. When I create a spark application JAR and try to run it from my
laptop, I get the same problem as #1, namely that it tries to find the
warehouse directory on my laptop itself.

Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
insights are much appreciated!

-Ryan Victory


Re: how to manage HBase connections in Executors of Spark Streaming ?

2020-11-25 Thread chen kevin
  1.  the issue about that Kerberos expires.
 *   You don’t need to care aboubt usually, you can use the local keytab at 
every node in the Hadoop cluster.
 *   If there don’t have the keytab in your Hadoop cluster, you will need 
update your keytab in every executor periodically。
  2.  best practices about how to manage Hbase connections with kerberos 
authentication, the demo.java is the code about how to get the hbase connection.




From: big data 
Date: Tuesday, November 24, 2020 at 1:58 PM
To: "user@spark.apache.org" 
Subject: how to manage HBase connections in Executors of Spark Streaming ?


Hi,

Does any best practices about how to manage Hbase connections with kerberos 
authentication in Spark Streaming (YARN) environment?

Want to now how executors manage the HBase connections,how to create them, 
close them and refresh Kerberos expires.

Thanks.


demo.java
Description: demo.java

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How to submit a job via REST API?

2020-11-25 Thread Zhou Yang
Hi all,

I found the solution through the source code. Appending the —conf k-v into 
`sparkProperties` work.
For example:

./spark-submit \
—conf foo=bar \
xxx

equals to

{
“xxx” : “yyy”,
“sparkProperties” : {
“foo": "bar"
}
}

Thanks for your reply.

2020年11月25日 下午3:55,vaquar khan 
mailto:vaquar.k...@gmail.com>> 写道:

Hi Yang,

Please find following link

https://stackoverflow.com/questions/63677736/spark-application-as-a-rest-service/63678337#63678337

Regards,
Vaquar khan

On Wed, Nov 25, 2020 at 12:40 AM Sonal Goyal 
mailto:sonalgoy...@gmail.com>> wrote:
You should be able to supply the --conf and its values as part of appArgs 
argument

Cheers,
Sonal
Nube Technologies
Join me at
Data Con LA Oct 23 | Big Data Conference Europe. Nov 24 | GIDS AI/ML Dec 3




On Tue, Nov 24, 2020 at 11:31 AM Dennis Suhari 
mailto:d.suh...@icloud.com.invalid>> wrote:
Hi Yang,

I am using Livy Server for submitting jobs.

Br,

Dennis



Von meinem iPhone gesendet

Am 24.11.2020 um 03:34 schrieb Zhou Yang 
mailto:zhouyang...@outlook.com>>:


Dear experts,

I found a convenient way to submit job via Rest API at 
https://gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a#file-submit_job-sh.
But I did not know whether can I append `—conf` parameter like what I did in 
spark-submit. Can someone can help me with this issue?

Regards, Yang



--
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago