Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
A key part of what I'm trying to do involves NOT having to bring the data
"through" the driver in order to get the cluster to work on it (which would
involve a network hop from server to laptop and another from laptop to
server). I'd rather have the data stay on the server and the driver stay on
my laptop if possible, but I'm guessing the Spark APIs/topology wasn't
designed that way.

What I was hoping for was some way to be able to say val df =
spark.sql("SELECT * FROM parquet.`*local://*/opt/data/transactions.parquet`")
or similar to convince Spark to not move the data. I'd imagine if I used
HDFS, data locality would kick in anyways to prevent the network shuffles
between the driver and the cluster, but even then I wonder (based on what
you guys are saying) if I'm wrong.

Perhaps I'll just have to modify the workflow to move the JAR to the server
and execute it from there. This isn't ideal but it's better than nothing.

-Ryan

On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho 
wrote:

> I'm also curious if this is possible, so while I can't offer a solution
> maybe you could try the following.
>
> The driver and executor nodes need to have access to the same
> (distributed) file system, so you could try to mount the file system to
> your laptop, locally, and then try to submit jobs and/or use the
> spark-shell while connected to the same system.
>
> A quick google search led me to find this article where someone shows how
> to mount an HDFS locally. It appears that Cloudera supports some kind of
> FUSE-based library, which may be useful for your use-case.
>
> https://idata.co.il/2018/10/how-to-connect-hdfs-to-local-filesystem/
>
> On Wed, 2020-11-25 at 08:51 -0600, Ryan Victory wrote:
>
> Hello!
>
> I have been tearing my hair out trying to solve this problem. Here is my
> setup:
>
> 1. I have Spark running on a server in standalone mode with data on the
> filesystem of the server itself (/opt/data/).
> 2. I have an instance of a Hive Metastore server running (backed by
> MariaDB) on the same server
> 3. I have a laptop where I am developing my spark jobs (Scala)
>
> I have configured Spark to use the metastore and set the warehouse
> directory to be in /opt/data/warehouse/. What I am trying to accomplish are
> a couple of things:
>
> 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> have the driver run on my local machine (my laptop). I want the jobs to use
> the data ON THE SERVER and not try to reference it from my local machine.
> If I do something like this:
>
> val df = spark.sql("SELECT * FROM
> parquet.`/opt/data/transactions.parquet`")
>
> I get an error that the path doesn't exist (because it's trying to find it
> on my laptop). If I run the same thing in a spark-shell on the spark server
> itself, there isn't an issue because the driver has access to the data. If
> I submit the job with submit-mode=cluster then it works too because the
> driver is on the cluster. I don't want this, I want to get the results on
> my laptop.
>
> How can I force Spark to read the data from the cluster's filesystem and
> not the driver's?
>
> 2. I have setup a Hive Metastore and created a table (in the spark shell
> on the spark server itself). The data in the warehouse is in the local
> filesystem. When I create a spark application JAR and try to run it from my
> laptop, I get the same problem as #1, namely that it tries to find the
> warehouse directory on my laptop itself.
>
> Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
> insights are much appreciated!
>
> -Ryan Victory
>
>
>


Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Thanks Apostolos,

I'm trying to avoid standing up HDFS just for this use case (single node).

-Ryan

On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:

> Hi Ryan,
>
> since the driver is at your laptop, in order to access a remote file you
> need to specify the url for this I guess.
>
> For example, when I am using Spark over HDFS I specify the file like
> hdfs://blablabla which contains the url where namenode
>
> can answer. I believe that something similar must be done here.
>
> all the best,
>
> Apostolos
>
>
> On 25/11/20 16:51, Ryan Victory wrote:
> > Hello!
> >
> > I have been tearing my hair out trying to solve this problem. Here is
> > my setup:
> >
> > 1. I have Spark running on a server in standalone mode with data on
> > the filesystem of the server itself (/opt/data/).
> > 2. I have an instance of a Hive Metastore server running (backed by
> > MariaDB) on the same server
> > 3. I have a laptop where I am developing my spark jobs (Scala)
> >
> > I have configured Spark to use the metastore and set the warehouse
> > directory to be in /opt/data/warehouse/. What I am trying to
> > accomplish are a couple of things:
> >
> > 1. I am trying to submit Spark jobs (via JARs) using spark-submit, but
> > have the driver run on my local machine (my laptop). I want the jobs
> > to use the data ON THE SERVER and not try to reference it from my
> > local machine. If I do something like this:
> >
> > val df = spark.sql("SELECT * FROM
> > parquet.`/opt/data/transactions.parquet`")
> >
> > I get an error that the path doesn't exist (because it's trying to
> > find it on my laptop). If I run the same thing in a spark-shell on the
> > spark server itself, there isn't an issue because the driver has
> > access to the data. If I submit the job with submit-mode=cluster then
> > it works too because the driver is on the cluster. I don't want this,
> > I want to get the results on my laptop.
> >
> > How can I force Spark to read the data from the cluster's filesystem
> > and not the driver's?
> >
> > 2. I have setup a Hive Metastore and created a table (in the spark
> > shell on the spark server itself). The data in the warehouse is in the
> > local filesystem. When I create a spark application JAR and try to run
> > it from my laptop, I get the same problem as #1, namely that it tries
> > to find the warehouse directory on my laptop itself.
> >
> > Am I crazy? Perhaps this isn't a supported way to use Spark? Any help
> > or insights are much appreciated!
> >
> > -Ryan Victory
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Hello!

I have been tearing my hair out trying to solve this problem. Here is my
setup:

1. I have Spark running on a server in standalone mode with data on the
filesystem of the server itself (/opt/data/).
2. I have an instance of a Hive Metastore server running (backed by
MariaDB) on the same server
3. I have a laptop where I am developing my spark jobs (Scala)

I have configured Spark to use the metastore and set the warehouse
directory to be in /opt/data/warehouse/. What I am trying to accomplish are
a couple of things:

1. I am trying to submit Spark jobs (via JARs) using spark-submit, but have
the driver run on my local machine (my laptop). I want the jobs to use the
data ON THE SERVER and not try to reference it from my local machine. If I
do something like this:

val df = spark.sql("SELECT * FROM parquet.`/opt/data/transactions.parquet`")

I get an error that the path doesn't exist (because it's trying to find it
on my laptop). If I run the same thing in a spark-shell on the spark server
itself, there isn't an issue because the driver has access to the data. If
I submit the job with submit-mode=cluster then it works too because the
driver is on the cluster. I don't want this, I want to get the results on
my laptop.

How can I force Spark to read the data from the cluster's filesystem and
not the driver's?

2. I have setup a Hive Metastore and created a table (in the spark shell on
the spark server itself). The data in the warehouse is in the local
filesystem. When I create a spark application JAR and try to run it from my
laptop, I get the same problem as #1, namely that it tries to find the
warehouse directory on my laptop itself.

Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or
insights are much appreciated!

-Ryan Victory


Unsubscribe

2019-12-11 Thread Ryan Victory



Re: Appropriate Apache Users List Uses

2016-02-09 Thread Ryan Victory
Yeah, a little disappointed with this, I wouldn't expect to be sent
unsolicited mail based on my membership to this list.

-Ryan Victory

On Tue, Feb 9, 2016 at 1:36 PM, John Omernik <j...@omernik.com> wrote:

> All, I received this today, is this appropriate list use? Note: This was
> unsolicited.
>
> Thanks
> John
>
>
>
> From: Pierce Lamb <pl...@snappydata.io>
> 11:57 AM (1 hour ago)
> to me
>
> Hi John,
>
> I saw you on the Spark Mailing List and noticed you worked for * and
> wanted to reach out. My company, SnappyData, just launched an open source
> OLTP + OLAP Database built on Spark. Our lead investor is Pivotal, whose
> largest owner is EMC which makes * like a father figure :)
>
> SnappyData’s goal is two fold: Operationalize Spark and deliver truly
> interactive queries. To do this, we first integrated Spark with an
> in-memory database with a pedigree of production customer deployments:
> GemFireXD (GemXD).
>
> GemXD operationalized Spark via:
>
> -- True high availability
>
> -- A highly concurrent environment
>
> -- An OLTP engine that can process transactions (mutable state)
>
> With GemXD as a storage engine, we packaged SnappyData with Approximate
> Query Processing (AQP) technology. AQP enables interactive response times
> even when data volumes are huge because it allows the developer to trade
> latency for accuracy. AQP queries (SQL queries with a specified error rate)
> execute on sample tables -- tables that have taken a stratified sample of
> the full dataset. As such, AQP queries enable much faster decisions when
> 100% accuracy isn’t needed and sample tables require far fewer resources to
> manage.
>
> If that sounds interesting to you, please check out our Github repo (our
> release is hosted there under “releases”):
>
> https://github.com/SnappyDataInc/snappydata
>
> We also have a technical paper that dives into the architecture:
> http://www.snappydata.io/snappy-industrial
>
> Are you currently using Spark at ? I’d love to set up a call with you
> and hear about how you’re using it and see if SnappyData could be a fit.
>
> In addition to replying to this email, there are many ways to chat with
> us: https://github.com/SnappyDataInc/snappydata#community-support
>
> Hope to hear from you,
>
> Pierce
>
> pl...@snappydata.io
>
> http://www.twitter.com/snappydata
>