Re: better way to schedule pyspark with SparkOperator on Airflow

2019-02-07 Thread Tao Feng
Thanks Fokko, will take a look. On Thu, Feb 7, 2019 at 12:08 AM Driesprong, Fokko wrote: > Hi Tao, > > For the Dataproc, which is the managed hadoop of GCP, I've implemented a > method a while ago. It will check if the Python file is local, and if this > is the case, it will be uploaded to the t

Re: better way to schedule pyspark with SparkOperator on Airflow

2019-02-07 Thread Driesprong, Fokko
Hi Tao, For the Dataproc, which is the managed hadoop of GCP, I've implemented a method a while ago. It will check if the Python file is local, and if this is the case, it will be uploaded to the temporary bucket which is provided with the cluster: https://github.com/apache/airflow/blob/master/air

better way to schedule pyspark with SparkOperator on Airflow

2019-02-06 Thread Tao Feng
Hi, I wonder any suggestions on how to use SparkOperator to send pyspark file to the spark cluster. And any suggestions on how to specify the pyspark dependency ? We currently push user pyspark file and dependency to a S3 location and get picked up by our Spark cluster. And we would like to explo