Usually this isn't done as the data is meant to be on a shared/distributed
storage, eg HDFS, S3, etc.

Spark should then read this data into a dataframe and your code logic
applies to the dataframe in a distributed manner.

On Wed, 29 Jan 2020 at 09:37, Tharindu Mathew <tharindu.mat...@gmail.com>
wrote:

> That was really helpful. Thanks! I actually solved my problem using by
> creating a venv and using the venv flags. Wondering now how to submit the
> data as an archive? Any idea?
>
> On Mon, Jan 27, 2020, 9:25 PM Chris Teoh <chris.t...@gmail.com> wrote:
>
>> Use --py-files
>>
>> See
>> https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
>>
>> I hope that helps.
>>
>> On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, <tharindu.mat...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Newbie to pyspark/spark here.
>>>
>>> I'm trying to submit a job to pyspark with a dependency. Spark DL in
>>> this case. While the local environment has this the pyspark does not see
>>> it. How do I correctly start pyspark so that it sees this dependency?
>>>
>>> Using Spark 2.3.0 in a cloudera setup.
>>>
>>> --
>>> Regards,
>>> Tharindu Mathew
>>> http://tharindumathew.com
>>>
>>

-- 
Chris

Reply via email to