Re: Optimizing S3 access for Drill using Parquet files

Paul Mogren Tue, 14 Jul 2015 12:14:30 -0700

Stefan,

You might be interested in http://tachyon-project.org





On 7/14/15, 1:12 PM, "Stefán Baxter" <[email protected]> wrote:

>Hi,
>
>Thank you.
>
>I was not suggesting this to be a part of Drill, only asking if any
>experience exist in this area. :)
>
>I'm trying to evaluate S3-almost-only vs. HDFS so your points are handy.
>
>Regards,
> -Stefan
>
>
>
>On Tue, Jul 14, 2015 at 5:08 PM, Jason Altekruse
><[email protected]>
>wrote:
>
>> I am not aware of anyone doing something like this today, but it seems
>>like
>> something best handled outside of Drill right now. Drill considers
>>itself
>> essentially stateless, we do not manage indexes, table constraints or
>> caching data for any of our current storage systems. There was some work
>> being done to cache Parquet metadata, in this case we were placing all
>>of
>> the parquet footers in a single file, which would need to be manually
>> refreshed. This work has not made it into the mainline, but you can
>>follow
>> the progress here:
>>
>> https://issues.apache.org/jira/browse/DRILL-2743
>>
>> I would take a look around for general purpose local caching systems for
>> S3. To make these work with Drill today they will have to re-expose the
>> HDFS API. There might be something out there that already does this,
>>but as
>> some of the primary users of S3 are web application developers, they
>>might
>> not have worried about providing the HDFS API on top of any caching
>>systems
>> developed to date.
>>
>> One thing to note, the HDFS API is already available on top of the local
>> file system, this is what enables us to read from the local disk in
>> embedded mode. If you can get a caching system to expose NFS, you could
>> mount this to the same path on all of your nodes and it should be able
>>to
>> read from that path mounted on your local FS.
>>
>>
>>
>> On Tue, Jul 14, 2015 at 1:06 AM, Stefán Baxter
>><[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm wondering if the people that use Drill with S3 are using some
>>sort of
>> > local cache on the drillbit-nodes for historical, non changing,
>>Parquet
>> > segments.
>> >
>> > I'm pretty sure that I'm not using the correct terminology and that
>>the
>> > correct question is this: Are there any ways to optimize S3 with
>>drill so
>> > that "hot segments" are stored locally while hot and then just dropped
>> from
>> > local nodes when they are not.
>> >
>> > I guess this only really matters  where networking speeds between the
>> > drill-bit nodes and S3 is not optimal.
>> >
>> > Regards,
>> >  -Stefan
>> >
>>

Re: Optimizing S3 access for Drill using Parquet files

Reply via email to