Now that I have Drill working with parquet files on dfs, the next step was to move the parquet files to S3.
I get pretty good performance - I can query for events by date range within 10 seconds. ( out of a total of ~ 800M events across 25 years) However, there seems to be some threshold beyond which queries start timing out. SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for connection from pool My first question is, is there a default timeout value to queries against S3? Anything that takes longer than ~ 150 seconds seems to hit the timeout error. The second question has to do with the possible conditions that trigger the prolonged query time. It seems that if I increase the filters beyond a certain number - it doesn't take much - the query times out. For example the query: select * from events where YEAR in (2012, 2013) works fine - however, select * from events where YEAR in (2012, 2013, 2014) fails with a timeout. To make it worse, I can't use the first query either until I restart drill...
