Drill will make efforts to execute portions of queries locally, but that doesn't look like a powerful enough mechanism for your use case since S3 isn't really local to anything.
Also, as a philosophy, Drill delegates all handling of materialized views to you rather than taking responsibility for it. Speaking concretely, I think that the following would meet your requirements. I will be speaking from the point of view of hosting data on a MapR cluster since that is what I know best. You should be able to implement similar mechanisms on other Hadoop distributions, although with degraded guarantees. The basic idea as I understand it is that you have a large amount of data in S3 that you periodically want to aggregate using Drill using an EMR/EC2 based cluster and also magically have access to the most recent aggregates in another non-Amazon cluster for other kinds of queries. One process that I would suggest for meeting this requirement would be: *Option 1: Ephemeral EC2/VPC cluster with long-lived storage, mirror transfer of data* 1) restart previously stopped virtual private cloud (VPC) based cluster. This will continue from the last cluster state, including knowledge of previously started data mirrors. It will also launch all Drill bits. All cluster data is on EBS and so will survive VPC cluster shutdowns. This operation would be initiated from your local machines. 2) launch Drill query to do aggregations from S3. Results should be written to MapR FS volume that is configured for mirroring back to your local cluster. This launch could be done from your local machines. 3) once aggregations complete, initiate mirroring of aggregation product back to local cluster. This will do a file-system level snapshot and initiate transfer of only the changed blocks. These blocks will contain your new aggregation results. This process has to happen on the VPC cluster, but could be initiated securely via a number of mechanisms. 4) when the mirror operation completes (depends on network speeds, transfer overhead should be <20% of total available bandwidth), pause VPC cluster. Billing will continue for data stored on EBS, but billing for cluster hosts will stop. This action would be initiated from your local machines. 5) on your local cluster, the mirrored volume containing your aggregates will now contain the latest aggregates. These aggregates will appear atomically so you won't have any point in time when part of the aggregates are visible and part are not. 6) repeat process on whatever schedule you like. *Option 2: Ephemeral VPC cluster, asynchronous table mirroring of aggregates* The first option has the virtue that no aggregate data will be visible locally until all aggregate is visible. This may sometimes be a vice instead of a virtue. In such a case, you can get similar guarantees around reliable transfer, but also transfer aggregates a record at a time by using a table replication strategy. The process would proceed as in Option 1, but steps 3-5 would be implemented using table replication. The aggregation process would insert aggregates into a MapR DB table which is configured to replicate back to a mirrored table on your local cluster. Shortly after the each aggregate record is inserted into the table on the VPC side, that record will appear on the local cluster. After the last transfer completes, the VPC cluster can be stopped and you continue as before. *Summary* As you can see, these options use local and remote Drill clusters for different purposes. You can initiate queries remotely, but you will have to specify which cluster explicitly. Drill also does not address data motion so you need to handle that yourself. 2) configure On Tue, Jun 2, 2015 at 9:56 AM, Jeroen van Dijk <[email protected]> wrote: > Hi all, > > (I only know Drill superficially, apologies for implied ignorance) > > Is it currently possible to have a local Drill cluster interact with a > remote Drill cluster to offload work and data transfer? To be more precise, > I'm thinking about the following. I have a big set of local data on a local > hdfs cluster (Europe) which I want to join with data that is hosted on S3 > (US-east). I would like to prevent large data transfers where I'm mostly > interested in aggregate data. Would it be possible to have my local cluster > send instructions to a remote (EMR) cluster to achieve this? If not, are > there other effective ways to deal with this situation in Drill? > > Thanks, > Jeroen >
