When I posted the question I was hoping that there was already something
available like the following below (using an adapted example from the drill
site):
SELECT * FROM (SELECT t.trans_id,
t.trans_info.prod_id[0] AS prod_id,
t.trans_info.purch_flag AS purchased
FROM remote-drill-cluster.dfs.`clicks/clicks.json` t)
sqWHERE sq.prod_id BETWEEN 700 AND 750 AND
sq.purchased = 'true'ORDER BY sq.prod_id;
The table name 'remote-drill-cluster' would tell Drill that it would have
to connect to a remote Drill cluster and use the result of the query as
input for the other query. I am not sure if this fits in the plugin
architecture and whether this is hard to implement. I can imagine though
that this will be very powerful and makes the mental overhead very small
for the enduser.
I'll first try your suggestion before I dive into any plugin details
though. Thanks again
On Tue, Jun 2, 2015 at 11:17 AM, Ted Dunning <[email protected]> wrote:
>
> Happy to help. Let me know when you find the ways that my answer was
> incomplete or not according to your needs. I have no illusions that
> something typed too quickly in a hotel room has great depth and would love
> to improve the suggestion.
>
>
>
> On Tue, Jun 2, 2015 at 10:59 AM, Jeroen van Dijk <
> [email protected]> wrote:
>
>> Thanks a lot for the complete and clear response, Ted. I hope to start an
>> experiment soon and see if can get this to work.
>>
>> Jeroen
>>
>> On Tue, Jun 2, 2015 at 10:29 AM, Ted Dunning <[email protected]>
>> wrote:
>>
>> > Drill will make efforts to execute portions of queries locally, but that
>> > doesn't look like a powerful enough mechanism for your use case since S3
>> > isn't really local to anything.
>> >
>> > Also, as a philosophy, Drill delegates all handling of materialized
>> views
>> > to you rather than taking responsibility for it.
>> >
>> > Speaking concretely, I think that the following would meet your
>> > requirements. I will be speaking from the point of view of hosting
>> data on
>> > a MapR cluster since that is what I know best. You should be able to
>> > implement similar mechanisms on other Hadoop distributions, although
>> with
>> > degraded guarantees.
>> >
>> > The basic idea as I understand it is that you have a large amount of
>> data
>> > in S3 that you periodically want to aggregate using Drill using an
>> EMR/EC2
>> > based cluster and also magically have access to the most recent
>> aggregates
>> > in another non-Amazon cluster for other kinds of queries.
>> >
>> > One process that I would suggest for meeting this requirement would be:
>> >
>> > *Option 1: Ephemeral EC2/VPC cluster with long-lived storage, mirror
>> > transfer of data*
>> >
>> > 1) restart previously stopped virtual private cloud (VPC) based cluster.
>> > This will continue from the last cluster state, including knowledge of
>> > previously started data mirrors. It will also launch all Drill bits.
>> All
>> > cluster data is on EBS and so will survive VPC cluster shutdowns. This
>> > operation would be initiated from your local machines.
>> >
>> > 2) launch Drill query to do aggregations from S3. Results should be
>> > written to MapR FS volume that is configured for mirroring back to your
>> > local cluster. This launch could be done from your local machines.
>> >
>> > 3) once aggregations complete, initiate mirroring of aggregation product
>> > back to local cluster. This will do a file-system level snapshot and
>> > initiate transfer of only the changed blocks. These blocks will contain
>> > your new aggregation results. This process has to happen on the VPC
>> > cluster, but could be initiated securely via a number of mechanisms.
>> >
>> > 4) when the mirror operation completes (depends on network speeds,
>> transfer
>> > overhead should be <20% of total available bandwidth), pause VPC
>> cluster.
>> > Billing will continue for data stored on EBS, but billing for cluster
>> hosts
>> > will stop. This action would be initiated from your local machines.
>> >
>> > 5) on your local cluster, the mirrored volume containing your aggregates
>> > will now contain the latest aggregates. These aggregates will appear
>> > atomically so you won't have any point in time when part of the
>> aggregates
>> > are visible and part are not.
>> >
>> > 6) repeat process on whatever schedule you like.
>> >
>> >
>> > *Option 2: Ephemeral VPC cluster, asynchronous table mirroring of
>> > aggregates*
>> >
>> > The first option has the virtue that no aggregate data will be visible
>> > locally until all aggregate is visible. This may sometimes be a vice
>> > instead of a virtue.
>> >
>> > In such a case, you can get similar guarantees around reliable transfer,
>> > but also transfer aggregates a record at a time by using a table
>> > replication strategy. The process would proceed as in Option 1, but
>> steps
>> > 3-5 would be implemented using table replication. The aggregation
>> process
>> > would insert aggregates into a MapR DB table which is configured to
>> > replicate back to a mirrored table on your local cluster. Shortly after
>> > the each aggregate record is inserted into the table on the VPC side,
>> that
>> > record will appear on the local cluster. After the last transfer
>> > completes, the VPC cluster can be stopped and you continue as before.
>> >
>> >
>> > *Summary*
>> >
>> > As you can see, these options use local and remote Drill clusters for
>> > different purposes. You can initiate queries remotely, but you will
>> have
>> > to specify which cluster explicitly. Drill also does not address data
>> > motion so you need to handle that yourself.
>> >
>> >
>> >
>> > 2) configure
>> >
>> >
>> > On Tue, Jun 2, 2015 at 9:56 AM, Jeroen van Dijk <
>> > [email protected]>
>> > wrote:
>> >
>> > > Hi all,
>> > >
>> > > (I only know Drill superficially, apologies for implied ignorance)
>> > >
>> > > Is it currently possible to have a local Drill cluster interact with a
>> > > remote Drill cluster to offload work and data transfer? To be more
>> > precise,
>> > > I'm thinking about the following. I have a big set of local data on a
>> > local
>> > > hdfs cluster (Europe) which I want to join with data that is hosted
>> on S3
>> > > (US-east). I would like to prevent large data transfers where I'm
>> mostly
>> > > interested in aggregate data. Would it be possible to have my local
>> > cluster
>> > > send instructions to a remote (EMR) cluster to achieve this? If not,
>> are
>> > > there other effective ways to deal with this situation in Drill?
>> > >
>> > > Thanks,
>> > > Jeroen
>> > >
>> >
>>
>
>