subject:"better way to schedule pyspark with SparkOperator on Airflow"

Re: better way to schedule pyspark with SparkOperator on Airflow

2019-02-07 Thread Tao Feng

Thanks Fokko, will take a look. On Thu, Feb 7, 2019 at 12:08 AM Driesprong, Fokko wrote: > Hi Tao, > > For the Dataproc, which is the managed hadoop of GCP, I've implemented a > method a while ago. It will check if the Python file is local, and if this > is the case, it will be uploaded to the t

Re: better way to schedule pyspark with SparkOperator on Airflow

2019-02-07 Thread Driesprong, Fokko

Hi Tao, For the Dataproc, which is the managed hadoop of GCP, I've implemented a method a while ago. It will check if the Python file is local, and if this is the case, it will be uploaded to the temporary bucket which is provided with the cluster: https://github.com/apache/airflow/blob/master/air

better way to schedule pyspark with SparkOperator on Airflow

2019-02-06 Thread Tao Feng

Hi, I wonder any suggestions on how to use SparkOperator to send pyspark file to the spark cluster. And any suggestions on how to specify the pyspark dependency ? We currently push user pyspark file and dependency to a S3 location and get picked up by our Spark cluster. And we would like to explo