I am not aware of anyone doing something like this today, but it seems like
something best handled outside of Drill right now. Drill considers itself
essentially stateless, we do not manage indexes, table constraints or
caching data for any of our current storage systems. There was some work
being done to cache Parquet metadata, in this case we were placing all of
the parquet footers in a single file, which would need to be manually
refreshed. This work has not made it into the mainline, but you can follow
the progress here:

https://issues.apache.org/jira/browse/DRILL-2743

I would take a look around for general purpose local caching systems for
S3. To make these work with Drill today they will have to re-expose the
HDFS API. There might be something out there that already does this, but as
some of the primary users of S3 are web application developers, they might
not have worried about providing the HDFS API on top of any caching systems
developed to date.

One thing to note, the HDFS API is already available on top of the local
file system, this is what enables us to read from the local disk in
embedded mode. If you can get a caching system to expose NFS, you could
mount this to the same path on all of your nodes and it should be able to
read from that path mounted on your local FS.



On Tue, Jul 14, 2015 at 1:06 AM, Stefán Baxter <[email protected]>
wrote:

> Hi,
>
> I'm wondering if the people that use Drill with S3 are using some sort of
> local cache on the drillbit-nodes for historical, non changing, Parquet
> segments.
>
> I'm pretty sure that I'm not using the correct terminology and that the
> correct question is this: Are there any ways to optimize S3 with drill so
> that "hot segments" are stored locally while hot and then just dropped from
> local nodes when they are not.
>
> I guess this only really matters  where networking speeds between the
> drill-bit nodes and S3 is not optimal.
>
> Regards,
>  -Stefan
>

Reply via email to