Thanks a lot for the complete and clear response, Ted. I hope to start an
experiment soon and see if can get this to work.

Jeroen

On Tue, Jun 2, 2015 at 10:29 AM, Ted Dunning <[email protected]> wrote:

> Drill will make efforts to execute portions of queries locally, but that
> doesn't look like a powerful enough mechanism for your use case since S3
> isn't really local to anything.
>
> Also, as a philosophy, Drill delegates all handling of materialized views
> to you rather than taking responsibility for it.
>
> Speaking concretely, I think that the following would meet your
> requirements.  I will be speaking from the point of view of hosting data on
> a MapR cluster since that is what I know best.  You should be able to
> implement similar mechanisms on other Hadoop distributions, although with
> degraded guarantees.
>
> The basic idea as I understand it is that you have a large amount of data
> in S3 that you periodically want to aggregate using Drill using an EMR/EC2
> based cluster and also magically have access to the most recent aggregates
> in another non-Amazon cluster for other kinds of queries.
>
> One process that I would suggest for meeting this requirement would be:
>
> *Option 1: Ephemeral EC2/VPC cluster with long-lived storage, mirror
> transfer of data*
>
> 1) restart previously stopped virtual private cloud (VPC) based cluster.
> This will continue from the last cluster state, including knowledge of
> previously started data mirrors.  It will also launch all Drill bits. All
> cluster data is on EBS and so will survive VPC cluster shutdowns.  This
> operation would be initiated from your local machines.
>
> 2) launch Drill query to do aggregations from S3.  Results should be
> written to MapR FS volume that is configured for mirroring back to your
> local cluster.  This launch could be done from your local machines.
>
> 3) once aggregations complete, initiate mirroring of aggregation product
> back to local cluster. This will do a file-system level snapshot and
> initiate transfer of only the changed blocks.  These blocks will contain
> your new aggregation results.  This process has to happen on the VPC
> cluster, but could be initiated securely via a number of mechanisms.
>
> 4) when the mirror operation completes (depends on network speeds, transfer
> overhead should be <20% of total available bandwidth), pause VPC cluster.
> Billing will continue for data stored on EBS, but billing for cluster hosts
> will stop.  This action would be initiated from your local machines.
>
> 5) on your local cluster, the mirrored volume containing your aggregates
> will now contain the latest aggregates.  These aggregates will appear
> atomically so you won't have any point in time when part of the aggregates
> are visible and part are not.
>
> 6) repeat process on whatever schedule you like.
>
>
> *Option 2: Ephemeral VPC cluster, asynchronous table mirroring of
> aggregates*
>
> The first option has the virtue that no aggregate data will be visible
> locally until all aggregate is visible.  This may sometimes be a vice
> instead of a virtue.
>
> In such a case, you can get similar guarantees around reliable transfer,
> but also transfer aggregates a record at a time by using a table
> replication strategy.  The process would proceed as in Option 1, but steps
> 3-5 would be implemented using table replication.  The aggregation process
> would insert aggregates into a MapR DB table which is configured to
> replicate back to a mirrored table on your local cluster.  Shortly after
> the each aggregate record is inserted into the table on the VPC side, that
> record will appear on the local cluster.  After the last transfer
> completes, the VPC cluster can be stopped and you continue as before.
>
>
> *Summary*
>
> As you can see, these options use local and remote Drill clusters for
> different purposes.  You can initiate queries remotely, but you will have
> to specify which cluster explicitly.  Drill also does not address data
> motion so you need to handle that yourself.
>
>
>
> 2) configure
>
>
> On Tue, Jun 2, 2015 at 9:56 AM, Jeroen van Dijk <
> [email protected]>
> wrote:
>
> > Hi all,
> >
> > (I only know Drill superficially, apologies for implied ignorance)
> >
> > Is it currently possible to have a local Drill cluster interact with a
> > remote Drill cluster to offload work and data transfer? To be more
> precise,
> > I'm thinking about the following. I have a big set of local data on a
> local
> > hdfs cluster (Europe) which I want to join with data that is hosted on S3
> > (US-east). I would like to prevent large data transfers where I'm mostly
> > interested in aggregate data. Would it be possible to have my local
> cluster
> > send instructions to a remote (EMR) cluster to achieve this? If not, are
> > there other effective ways to deal with this situation in Drill?
> >
> > Thanks,
> > Jeroen
> >
>

Reply via email to