Drill will make efforts to execute portions of queries locally, but that
doesn't look like a powerful enough mechanism for your use case since S3
isn't really local to anything.

Also, as a philosophy, Drill delegates all handling of materialized views
to you rather than taking responsibility for it.

Speaking concretely, I think that the following would meet your
requirements.  I will be speaking from the point of view of hosting data on
a MapR cluster since that is what I know best.  You should be able to
implement similar mechanisms on other Hadoop distributions, although with
degraded guarantees.

The basic idea as I understand it is that you have a large amount of data
in S3 that you periodically want to aggregate using Drill using an EMR/EC2
based cluster and also magically have access to the most recent aggregates
in another non-Amazon cluster for other kinds of queries.

One process that I would suggest for meeting this requirement would be:

*Option 1: Ephemeral EC2/VPC cluster with long-lived storage, mirror
transfer of data*

1) restart previously stopped virtual private cloud (VPC) based cluster.
This will continue from the last cluster state, including knowledge of
previously started data mirrors.  It will also launch all Drill bits. All
cluster data is on EBS and so will survive VPC cluster shutdowns.  This
operation would be initiated from your local machines.

2) launch Drill query to do aggregations from S3.  Results should be
written to MapR FS volume that is configured for mirroring back to your
local cluster.  This launch could be done from your local machines.

3) once aggregations complete, initiate mirroring of aggregation product
back to local cluster. This will do a file-system level snapshot and
initiate transfer of only the changed blocks.  These blocks will contain
your new aggregation results.  This process has to happen on the VPC
cluster, but could be initiated securely via a number of mechanisms.

4) when the mirror operation completes (depends on network speeds, transfer
overhead should be <20% of total available bandwidth), pause VPC cluster.
Billing will continue for data stored on EBS, but billing for cluster hosts
will stop.  This action would be initiated from your local machines.

5) on your local cluster, the mirrored volume containing your aggregates
will now contain the latest aggregates.  These aggregates will appear
atomically so you won't have any point in time when part of the aggregates
are visible and part are not.

6) repeat process on whatever schedule you like.


*Option 2: Ephemeral VPC cluster, asynchronous table mirroring of
aggregates*

The first option has the virtue that no aggregate data will be visible
locally until all aggregate is visible.  This may sometimes be a vice
instead of a virtue.

In such a case, you can get similar guarantees around reliable transfer,
but also transfer aggregates a record at a time by using a table
replication strategy.  The process would proceed as in Option 1, but steps
3-5 would be implemented using table replication.  The aggregation process
would insert aggregates into a MapR DB table which is configured to
replicate back to a mirrored table on your local cluster.  Shortly after
the each aggregate record is inserted into the table on the VPC side, that
record will appear on the local cluster.  After the last transfer
completes, the VPC cluster can be stopped and you continue as before.


*Summary*

As you can see, these options use local and remote Drill clusters for
different purposes.  You can initiate queries remotely, but you will have
to specify which cluster explicitly.  Drill also does not address data
motion so you need to handle that yourself.



2) configure


On Tue, Jun 2, 2015 at 9:56 AM, Jeroen van Dijk <[email protected]>
wrote:

> Hi all,
>
> (I only know Drill superficially, apologies for implied ignorance)
>
> Is it currently possible to have a local Drill cluster interact with a
> remote Drill cluster to offload work and data transfer? To be more precise,
> I'm thinking about the following. I have a big set of local data on a local
> hdfs cluster (Europe) which I want to join with data that is hosted on S3
> (US-east). I would like to prevent large data transfers where I'm mostly
> interested in aggregate data. Would it be possible to have my local cluster
> send instructions to a remote (EMR) cluster to achieve this? If not, are
> there other effective ways to deal with this situation in Drill?
>
> Thanks,
> Jeroen
>

Reply via email to