What you suggest would work exactly as you say, but you would lack some control over whether the Drill bits actually run on the remote cluster. If those local bits were busy, Drill might opt to run the bits locally. To accurately describe the cost of non-locality would be difficult in practice I think.
You definitely can do something like what you say by issuing the aggregation query to the remote Drill cluster, then issuing a create-table-as-select (CTAS) command to the local Drill cluster specifying input from the remote (EC2) cluster and output locally. For such a CTAS command, you would need a universal name space for your dfs data source (MapR provides this, HDFS does not). Such an approach would not give you an atomic view of the data or the results, but that may or may not be critical for you. Atomicity relative to the S3 input will be essentially impossible, but atomicity for the aggregation output and the CTAS command can be achieved using snapshots on the remote system. I think (but am not sure) that even HDFS snapshots would work if you do the snapshot after all aggregation results are complete (HDFS snapshots are not semantically correct with respect to open files and only apply to closed files). Doing the aggregation and moving the results is probably the easiest part of this. Starting and stopping the remote cluster and orchestrating the steps will probably be more expensive than any other aspect of your system. On Tue, Jun 2, 2015 at 12:30 PM, Jeroen van Dijk <[email protected] > wrote: > When I posted the question I was hoping that there was already something > available like the following below (using an adapted example from the drill > site): > > SELECT * FROM (SELECT t.trans_id, > t.trans_info.prod_id[0] AS prod_id, > t.trans_info.purch_flag AS purchased > FROM remote-drill-cluster.dfs.`clicks/clicks.json` t) sqWHERE > sq.prod_id BETWEEN 700 AND 750 AND > sq.purchased = 'true'ORDER BY sq.prod_id; > > The table name 'remote-drill-cluster' would tell Drill that it would have > to connect to a remote Drill cluster and use the result of the query as > input for the other query. I am not sure if this fits in the plugin > architecture and whether this is hard to implement. I can imagine though > that this will be very powerful and makes the mental overhead very small > for the enduser. > > I'll first try your suggestion before I dive into any plugin details > though. Thanks again > > > On Tue, Jun 2, 2015 at 11:17 AM, Ted Dunning <[email protected]> > wrote: > >> >> Happy to help. Let me know when you find the ways that my answer was >> incomplete or not according to your needs. I have no illusions that >> something typed too quickly in a hotel room has great depth and would love >> to improve the suggestion. >> >> >> >> On Tue, Jun 2, 2015 at 10:59 AM, Jeroen van Dijk < >> [email protected]> wrote: >> >>> Thanks a lot for the complete and clear response, Ted. I hope to start an >>> experiment soon and see if can get this to work. >>> >>> Jeroen >>> >>> On Tue, Jun 2, 2015 at 10:29 AM, Ted Dunning <[email protected]> >>> wrote: >>> >>> > Drill will make efforts to execute portions of queries locally, but >>> that >>> > doesn't look like a powerful enough mechanism for your use case since >>> S3 >>> > isn't really local to anything. >>> > >>> > Also, as a philosophy, Drill delegates all handling of materialized >>> views >>> > to you rather than taking responsibility for it. >>> > >>> > Speaking concretely, I think that the following would meet your >>> > requirements. I will be speaking from the point of view of hosting >>> data on >>> > a MapR cluster since that is what I know best. You should be able to >>> > implement similar mechanisms on other Hadoop distributions, although >>> with >>> > degraded guarantees. >>> > >>> > The basic idea as I understand it is that you have a large amount of >>> data >>> > in S3 that you periodically want to aggregate using Drill using an >>> EMR/EC2 >>> > based cluster and also magically have access to the most recent >>> aggregates >>> > in another non-Amazon cluster for other kinds of queries. >>> > >>> > One process that I would suggest for meeting this requirement would be: >>> > >>> > *Option 1: Ephemeral EC2/VPC cluster with long-lived storage, mirror >>> > transfer of data* >>> > >>> > 1) restart previously stopped virtual private cloud (VPC) based >>> cluster. >>> > This will continue from the last cluster state, including knowledge of >>> > previously started data mirrors. It will also launch all Drill bits. >>> All >>> > cluster data is on EBS and so will survive VPC cluster shutdowns. This >>> > operation would be initiated from your local machines. >>> > >>> > 2) launch Drill query to do aggregations from S3. Results should be >>> > written to MapR FS volume that is configured for mirroring back to your >>> > local cluster. This launch could be done from your local machines. >>> > >>> > 3) once aggregations complete, initiate mirroring of aggregation >>> product >>> > back to local cluster. This will do a file-system level snapshot and >>> > initiate transfer of only the changed blocks. These blocks will >>> contain >>> > your new aggregation results. This process has to happen on the VPC >>> > cluster, but could be initiated securely via a number of mechanisms. >>> > >>> > 4) when the mirror operation completes (depends on network speeds, >>> transfer >>> > overhead should be <20% of total available bandwidth), pause VPC >>> cluster. >>> > Billing will continue for data stored on EBS, but billing for cluster >>> hosts >>> > will stop. This action would be initiated from your local machines. >>> > >>> > 5) on your local cluster, the mirrored volume containing your >>> aggregates >>> > will now contain the latest aggregates. These aggregates will appear >>> > atomically so you won't have any point in time when part of the >>> aggregates >>> > are visible and part are not. >>> > >>> > 6) repeat process on whatever schedule you like. >>> > >>> > >>> > *Option 2: Ephemeral VPC cluster, asynchronous table mirroring of >>> > aggregates* >>> > >>> > The first option has the virtue that no aggregate data will be visible >>> > locally until all aggregate is visible. This may sometimes be a vice >>> > instead of a virtue. >>> > >>> > In such a case, you can get similar guarantees around reliable >>> transfer, >>> > but also transfer aggregates a record at a time by using a table >>> > replication strategy. The process would proceed as in Option 1, but >>> steps >>> > 3-5 would be implemented using table replication. The aggregation >>> process >>> > would insert aggregates into a MapR DB table which is configured to >>> > replicate back to a mirrored table on your local cluster. Shortly >>> after >>> > the each aggregate record is inserted into the table on the VPC side, >>> that >>> > record will appear on the local cluster. After the last transfer >>> > completes, the VPC cluster can be stopped and you continue as before. >>> > >>> > >>> > *Summary* >>> > >>> > As you can see, these options use local and remote Drill clusters for >>> > different purposes. You can initiate queries remotely, but you will >>> have >>> > to specify which cluster explicitly. Drill also does not address data >>> > motion so you need to handle that yourself. >>> > >>> > >>> > >>> > 2) configure >>> > >>> > >>> > On Tue, Jun 2, 2015 at 9:56 AM, Jeroen van Dijk < >>> > [email protected]> >>> > wrote: >>> > >>> > > Hi all, >>> > > >>> > > (I only know Drill superficially, apologies for implied ignorance) >>> > > >>> > > Is it currently possible to have a local Drill cluster interact with >>> a >>> > > remote Drill cluster to offload work and data transfer? To be more >>> > precise, >>> > > I'm thinking about the following. I have a big set of local data on a >>> > local >>> > > hdfs cluster (Europe) which I want to join with data that is hosted >>> on S3 >>> > > (US-east). I would like to prevent large data transfers where I'm >>> mostly >>> > > interested in aggregate data. Would it be possible to have my local >>> > cluster >>> > > send instructions to a remote (EMR) cluster to achieve this? If not, >>> are >>> > > there other effective ways to deal with this situation in Drill? >>> > > >>> > > Thanks, >>> > > Jeroen >>> > > >>> > >>> >> >> >
