great, this is what I expected to hear but wanted to double check. thanks
for all your help, Fokko
On Mon, Oct 16, 2017 at 1:08 PM, Driesprong, Fokko
wrote:
> Hi Boris,
>
> When kicking off Spark jobs using Airflow, cluster mode is highly
> recommended since the workload of the driver is on the
Hi Boris,
When kicking off Spark jobs using Airflow, cluster mode is highly
recommended since the workload of the driver is on the Hadoop cluster, and
not on the Airflow machine itself. Personally I prefer the spark-submit
operator since it will pull all the connection variables directly from
Airf
Thanks Fokko. Do you know if it is better to use pyspark directly within
python operator or invoke submit-job instead? My understanding in both
cases airflow uses yarn-client deployment mode, not yarn-cluster and spark
driver always runs on the same node with airflow worker. Not sure it is the
best
Hi Boris,
Instead of writing it to a file, you can also write it to xcom, this will
keep everything inside of Airflow. My personal opinion on this; spark-sql
is a bit limited by nature, it only support SQL. If you want to do more
dynamic stuff, you will eventually have to move to spark-submit anyw
Thanks Fokko, I think it will do it but my concern that in this case my dag
will initiate two separate spark sessions and it takes about 20 seconds in
our yarn environment to create it. I need to run 600 dags like that every
morning.
I am thinking now to create a pyspark job that will do insert an
Hi Boris,
That sounds like a nice DAG.
This is how I would do it: First run the long running query in a spark-sql
operator like you have now. Create a python function that builds a
SparkSession within Python (using the Spark pyspark api) and fetches the
count from the spark partition that you've
Hi Fokko, thanks for your response, really appreciate it!
Basically in my case I have two Spark SQL queries:
1) the first query does INSERT OVERWRITE to a partition and may take a
while for a while
2) then I run a second query right after it to get count of rows of that
partition.
3) I need to pa
Hi Boris,
Thank you for your question and excuse me for the late response, currently
I'm on holiday.
The solution that you suggest, would not be my preferred choice. Extracting
results from a log using a regex is expensive in terms of computational
costs, and error prone. My question is, what are
hi guys,
I opened JIRA on this and will be working on PR
https://issues.apache.org/jira/browse/AIRFLOW-1713
any objections/suggestions conceptually?
Fokko, I see you have been actively contributing to spark hooks and
operators so I could use your opinion before I implement this.
Boris