Re: Meetup Interest?

2017-10-14 Thread Marc Bollinger
+1 We'd definitely be in. Would love to chat more about K8s/Airflow--Data Eng has been a little twitchy about being the guinea pigs in our org, but the production app is now serving all traffic from it, so we're planning out our strategy. On Fri, Oct 13, 2017 at 1:29 PM, Daniel Imberman (BLOOMBER

Re: Redshift operation examples

2017-10-14 Thread Veeranagouda Mukkanagoudar
Thanks Andy, I think this is helpful . On Sat, Oct 14, 2017 at 12:22 PM, Andy Hadjigeorgiou wrote: > Hello, > > If you are looking for querying Redshift clusters, PostgresOperators and > PostgresHook is what you are looking for. Here's the docs >

Re: Redshift operation examples

2017-10-14 Thread Andy Hadjigeorgiou
This blog post has a good example of using PostgresHook to query a database - all you'd need to change is the connection info (to however you'd normally access your Redshift cluster to query). Andy --- Software Engineer | Fundera

Re: Redshift operation examples

2017-10-14 Thread Andy Hadjigeorgiou
Hello, If you are looking for querying Redshift clusters, PostgresOperators and PostgresHook is what you are looking for. Here's the docs for both. If you are looking to manage Redshift clusters, right now you'd have to use bo

Redshift operation examples

2017-10-14 Thread Veeranagouda Mukkanagoudar
I am new to Airflow, Can anyone point me to Redshift/Postgress operator or task implementation examples. -Thanks Veera

Re: spark sql hook with multiple queries

2017-10-14 Thread Boris Tyukin
Hi Fokko, looks like you've fixed the issue that was causing it :) [AIRFLOW-1562] Spark-sql logging contains deadlock This is exactly what I was seeing - the process would just freeze on the second query I guess waiting for the lock on the log file Thanks! On Sat, Oct 14, 2017 at 5:07 AM, Dries

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Boris
Thanks Fokko, I think it will do it but my concern that in this case my dag will initiate two separate spark sessions and it takes about 20 seconds in our yarn environment to create it. I need to run 600 dags like that every morning. I am thinking now to create a pyspark job that will do insert an

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Driesprong, Fokko
Hi Boris, That sounds like a nice DAG. This is how I would do it: First run the long running query in a spark-sql operator like you have now. Create a python function that builds a SparkSession within Python (using the Spark pyspark api) and fetches the count from the spark partition that you've

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Boris Tyukin
Hi Fokko, thanks for your response, really appreciate it! Basically in my case I have two Spark SQL queries: 1) the first query does INSERT OVERWRITE to a partition and may take a while for a while 2) then I run a second query right after it to get count of rows of that partition. 3) I need to pa

Re: Question about skipping, state propagation, and trigger rules.

2017-10-14 Thread Daniel Lamblin [Data Science & Platform Center]
Thanks Alek, this is an interesting alternative approach that would accomplish what in looking for. I've got 24 such data staging tasks in a daily dag, so going from 2 to 4 take per data source is only mildly more work. The sensor was an s3 prefix sensor from https://airflow.incubator.apache.org/_

Re: spark sql hook with multiple queries

2017-10-14 Thread Driesprong, Fokko
Hi Boris, Interesting. Multiple queries is supported by the spark-sql operator and this should work using Airflow. Executing SQL from a file: Fokkos-MBP:~ fokkodriesprong$ spark-sql --driver-java-options "-Dlog4j.configuration=file:///tmp/log4j.properties" -f query.sql 1 Time taken: 1.976 seconds

Re: Return results optionally from spark_sql_hook

2017-10-14 Thread Driesprong, Fokko
Hi Boris, Thank you for your question and excuse me for the late response, currently I'm on holiday. The solution that you suggest, would not be my preferred choice. Extracting results from a log using a regex is expensive in terms of computational costs, and error prone. My question is, what are