Re: Submitting job with external dependencies to pyspark

2020-01-28 Thread Chris Teoh
Usually this isn't done as the data is meant to be on a shared/distributed
storage, eg HDFS, S3, etc.

Spark should then read this data into a dataframe and your code logic
applies to the dataframe in a distributed manner.

On Wed, 29 Jan 2020 at 09:37, Tharindu Mathew 
wrote:

> That was really helpful. Thanks! I actually solved my problem using by
> creating a venv and using the venv flags. Wondering now how to submit the
> data as an archive? Any idea?
>
> On Mon, Jan 27, 2020, 9:25 PM Chris Teoh  wrote:
>
>> Use --py-files
>>
>> See
>> https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
>>
>> I hope that helps.
>>
>> On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, 
>> wrote:
>>
>>> Hi,
>>>
>>> Newbie to pyspark/spark here.
>>>
>>> I'm trying to submit a job to pyspark with a dependency. Spark DL in
>>> this case. While the local environment has this the pyspark does not see
>>> it. How do I correctly start pyspark so that it sees this dependency?
>>>
>>> Using Spark 2.3.0 in a cloudera setup.
>>>
>>> --
>>> Regards,
>>> Tharindu Mathew
>>> http://tharindumathew.com
>>>
>>

-- 
Chris


Re: Submitting job with external dependencies to pyspark

2020-01-28 Thread Tharindu Mathew
That was really helpful. Thanks! I actually solved my problem using by
creating a venv and using the venv flags. Wondering now how to submit the
data as an archive? Any idea?

On Mon, Jan 27, 2020, 9:25 PM Chris Teoh  wrote:

> Use --py-files
>
> See
> https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
>
> I hope that helps.
>
> On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, 
> wrote:
>
>> Hi,
>>
>> Newbie to pyspark/spark here.
>>
>> I'm trying to submit a job to pyspark with a dependency. Spark DL in this
>> case. While the local environment has this the pyspark does not see it. How
>> do I correctly start pyspark so that it sees this dependency?
>>
>> Using Spark 2.3.0 in a cloudera setup.
>>
>> --
>> Regards,
>> Tharindu Mathew
>> http://tharindumathew.com
>>
>


Re: Submitting job with external dependencies to pyspark

2020-01-27 Thread Chris Teoh
Use --py-files

See
https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies

I hope that helps.

On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, 
wrote:

> Hi,
>
> Newbie to pyspark/spark here.
>
> I'm trying to submit a job to pyspark with a dependency. Spark DL in this
> case. While the local environment has this the pyspark does not see it. How
> do I correctly start pyspark so that it sees this dependency?
>
> Using Spark 2.3.0 in a cloudera setup.
>
> --
> Regards,
> Tharindu Mathew
> http://tharindumathew.com
>


Submitting job with external dependencies to pyspark

2020-01-27 Thread Tharindu Mathew
Hi,

Newbie to pyspark/spark here.

I'm trying to submit a job to pyspark with a dependency. Spark DL in this
case. While the local environment has this the pyspark does not see it. How
do I correctly start pyspark so that it sees this dependency?

Using Spark 2.3.0 in a cloudera setup.

-- 
Regards,
Tharindu Mathew
http://tharindumathew.com