Thanks a lot for the complete and clear response, Ted. I hope to start an experiment soon and see if can get this to work.
Jeroen On Tue, Jun 2, 2015 at 10:29 AM, Ted Dunning <[email protected]> wrote: > Drill will make efforts to execute portions of queries locally, but that > doesn't look like a powerful enough mechanism for your use case since S3 > isn't really local to anything. > > Also, as a philosophy, Drill delegates all handling of materialized views > to you rather than taking responsibility for it. > > Speaking concretely, I think that the following would meet your > requirements. I will be speaking from the point of view of hosting data on > a MapR cluster since that is what I know best. You should be able to > implement similar mechanisms on other Hadoop distributions, although with > degraded guarantees. > > The basic idea as I understand it is that you have a large amount of data > in S3 that you periodically want to aggregate using Drill using an EMR/EC2 > based cluster and also magically have access to the most recent aggregates > in another non-Amazon cluster for other kinds of queries. > > One process that I would suggest for meeting this requirement would be: > > *Option 1: Ephemeral EC2/VPC cluster with long-lived storage, mirror > transfer of data* > > 1) restart previously stopped virtual private cloud (VPC) based cluster. > This will continue from the last cluster state, including knowledge of > previously started data mirrors. It will also launch all Drill bits. All > cluster data is on EBS and so will survive VPC cluster shutdowns. This > operation would be initiated from your local machines. > > 2) launch Drill query to do aggregations from S3. Results should be > written to MapR FS volume that is configured for mirroring back to your > local cluster. This launch could be done from your local machines. > > 3) once aggregations complete, initiate mirroring of aggregation product > back to local cluster. This will do a file-system level snapshot and > initiate transfer of only the changed blocks. These blocks will contain > your new aggregation results. This process has to happen on the VPC > cluster, but could be initiated securely via a number of mechanisms. > > 4) when the mirror operation completes (depends on network speeds, transfer > overhead should be <20% of total available bandwidth), pause VPC cluster. > Billing will continue for data stored on EBS, but billing for cluster hosts > will stop. This action would be initiated from your local machines. > > 5) on your local cluster, the mirrored volume containing your aggregates > will now contain the latest aggregates. These aggregates will appear > atomically so you won't have any point in time when part of the aggregates > are visible and part are not. > > 6) repeat process on whatever schedule you like. > > > *Option 2: Ephemeral VPC cluster, asynchronous table mirroring of > aggregates* > > The first option has the virtue that no aggregate data will be visible > locally until all aggregate is visible. This may sometimes be a vice > instead of a virtue. > > In such a case, you can get similar guarantees around reliable transfer, > but also transfer aggregates a record at a time by using a table > replication strategy. The process would proceed as in Option 1, but steps > 3-5 would be implemented using table replication. The aggregation process > would insert aggregates into a MapR DB table which is configured to > replicate back to a mirrored table on your local cluster. Shortly after > the each aggregate record is inserted into the table on the VPC side, that > record will appear on the local cluster. After the last transfer > completes, the VPC cluster can be stopped and you continue as before. > > > *Summary* > > As you can see, these options use local and remote Drill clusters for > different purposes. You can initiate queries remotely, but you will have > to specify which cluster explicitly. Drill also does not address data > motion so you need to handle that yourself. > > > > 2) configure > > > On Tue, Jun 2, 2015 at 9:56 AM, Jeroen van Dijk < > [email protected]> > wrote: > > > Hi all, > > > > (I only know Drill superficially, apologies for implied ignorance) > > > > Is it currently possible to have a local Drill cluster interact with a > > remote Drill cluster to offload work and data transfer? To be more > precise, > > I'm thinking about the following. I have a big set of local data on a > local > > hdfs cluster (Europe) which I want to join with data that is hosted on S3 > > (US-east). I would like to prevent large data transfers where I'm mostly > > interested in aggregate data. Would it be possible to have my local > cluster > > send instructions to a remote (EMR) cluster to achieve this? If not, are > > there other effective ways to deal with this situation in Drill? > > > > Thanks, > > Jeroen > > >
