Re: Optimizing S3 access for Drill using Parquet files

Stefán Baxter Tue, 14 Jul 2015 10:13:37 -0700

Hi,

Thank you.


I was not suggesting this to be a part of Drill, only asking if any
experience exist in this area. :)

I'm trying to evaluate S3-almost-only vs. HDFS so your points are handy.

Regards,
 -Stefan



On Tue, Jul 14, 2015 at 5:08 PM, Jason Altekruse <[email protected]>
wrote:

> I am not aware of anyone doing something like this today, but it seems like
> something best handled outside of Drill right now. Drill considers itself
> essentially stateless, we do not manage indexes, table constraints or
> caching data for any of our current storage systems. There was some work
> being done to cache Parquet metadata, in this case we were placing all of
> the parquet footers in a single file, which would need to be manually
> refreshed. This work has not made it into the mainline, but you can follow
> the progress here:
>
> https://issues.apache.org/jira/browse/DRILL-2743
>
> I would take a look around for general purpose local caching systems for
> S3. To make these work with Drill today they will have to re-expose the
> HDFS API. There might be something out there that already does this, but as
> some of the primary users of S3 are web application developers, they might
> not have worried about providing the HDFS API on top of any caching systems
> developed to date.
>
> One thing to note, the HDFS API is already available on top of the local
> file system, this is what enables us to read from the local disk in
> embedded mode. If you can get a caching system to expose NFS, you could
> mount this to the same path on all of your nodes and it should be able to
> read from that path mounted on your local FS.
>
>
>
> On Tue, Jul 14, 2015 at 1:06 AM, Stefán Baxter <[email protected]>
> wrote:
>
> > Hi,
> >
> > I'm wondering if the people that use Drill with S3 are using some sort of
> > local cache on the drillbit-nodes for historical, non changing, Parquet
> > segments.
> >
> > I'm pretty sure that I'm not using the correct terminology and that the
> > correct question is this: Are there any ways to optimize S3 with drill so
> > that "hot segments" are stored locally while hot and then just dropped
> from
> > local nodes when they are not.
> >
> > I guess this only really matters  where networking speeds between the
> > drill-bit nodes and S3 is not optimal.
> >
> > Regards,
> >  -Stefan
> >
>

Re: Optimizing S3 access for Drill using Parquet files

Reply via email to