Re: Return results optionally from spark_sql_hook

2017-10-16 Thread Boris Tyukin
great, this is what I expected to hear but wanted to double check. thanks for all your help, Fokko On Mon, Oct 16, 2017 at 1:08 PM, Driesprong, Fokko wrote: > Hi Boris, > > When kicking off Spark jobs using Airflow, cluster mode is highly > recommended since the workload of the driver is on the

Re: Return results optionally from spark_sql_hook

2017-10-16 Thread Driesprong, Fokko
Hi Boris, When kicking off Spark jobs using Airflow, cluster mode is highly recommended since the workload of the driver is on the Hadoop cluster, and not on the Airflow machine itself. Personally I prefer the spark-submit operator since it will pull all the connection variables directly from Airf

Re: Return results optionally from spark_sql_hook

2017-10-15 Thread Boris
Thanks Fokko. Do you know if it is better to use pyspark directly within python operator or invoke submit-job instead? My understanding in both cases airflow uses yarn-client deployment mode, not yarn-cluster and spark driver always runs on the same node with airflow worker. Not sure it is the best

Re: Return results optionally from spark_sql_hook

2017-10-15 Thread Driesprong, Fokko
Hi Boris, Instead of writing it to a file, you can also write it to xcom, this will keep everything inside of Airflow. My personal opinion on this; spark-sql is a bit limited by nature, it only support SQL. If you want to do more dynamic stuff, you will eventually have to move to spark-submit anyw

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Boris
Thanks Fokko, I think it will do it but my concern that in this case my dag will initiate two separate spark sessions and it takes about 20 seconds in our yarn environment to create it. I need to run 600 dags like that every morning. I am thinking now to create a pyspark job that will do insert an

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Driesprong, Fokko
Hi Boris, That sounds like a nice DAG. This is how I would do it: First run the long running query in a spark-sql operator like you have now. Create a python function that builds a SparkSession within Python (using the Spark pyspark api) and fetches the count from the spark partition that you've

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Boris Tyukin
Hi Fokko, thanks for your response, really appreciate it! Basically in my case I have two Spark SQL queries: 1) the first query does INSERT OVERWRITE to a partition and may take a while for a while 2) then I run a second query right after it to get count of rows of that partition. 3) I need to pa

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Driesprong, Fokko
Hi Boris, Thank you for your question and excuse me for the late response, currently I'm on holiday. The solution that you suggest, would not be my preferred choice. Extracting results from a log using a regex is expensive in terms of computational costs, and error prone. My question is, what are

Return results optionally from spark_sql_hook

2017-10-13 Thread Boris
hi guys, I opened JIRA on this and will be working on PR https://issues.apache.org/jira/browse/AIRFLOW-1713 any objections/suggestions conceptually? Fokko, I see you have been actively contributing to spark hooks and operators so I could use your opinion before I implement this. Boris