Stefan, You might be interested in http://tachyon-project.org
On 7/14/15, 1:12 PM, "Stefán Baxter" <[email protected]> wrote: >Hi, > >Thank you. > >I was not suggesting this to be a part of Drill, only asking if any >experience exist in this area. :) > >I'm trying to evaluate S3-almost-only vs. HDFS so your points are handy. > >Regards, > -Stefan > > > >On Tue, Jul 14, 2015 at 5:08 PM, Jason Altekruse ><[email protected]> >wrote: > >> I am not aware of anyone doing something like this today, but it seems >>like >> something best handled outside of Drill right now. Drill considers >>itself >> essentially stateless, we do not manage indexes, table constraints or >> caching data for any of our current storage systems. There was some work >> being done to cache Parquet metadata, in this case we were placing all >>of >> the parquet footers in a single file, which would need to be manually >> refreshed. This work has not made it into the mainline, but you can >>follow >> the progress here: >> >> https://issues.apache.org/jira/browse/DRILL-2743 >> >> I would take a look around for general purpose local caching systems for >> S3. To make these work with Drill today they will have to re-expose the >> HDFS API. There might be something out there that already does this, >>but as >> some of the primary users of S3 are web application developers, they >>might >> not have worried about providing the HDFS API on top of any caching >>systems >> developed to date. >> >> One thing to note, the HDFS API is already available on top of the local >> file system, this is what enables us to read from the local disk in >> embedded mode. If you can get a caching system to expose NFS, you could >> mount this to the same path on all of your nodes and it should be able >>to >> read from that path mounted on your local FS. >> >> >> >> On Tue, Jul 14, 2015 at 1:06 AM, Stefán Baxter >><[email protected]> >> wrote: >> >> > Hi, >> > >> > I'm wondering if the people that use Drill with S3 are using some >>sort of >> > local cache on the drillbit-nodes for historical, non changing, >>Parquet >> > segments. >> > >> > I'm pretty sure that I'm not using the correct terminology and that >>the >> > correct question is this: Are there any ways to optimize S3 with >>drill so >> > that "hot segments" are stored locally while hot and then just dropped >> from >> > local nodes when they are not. >> > >> > I guess this only really matters where networking speeds between the >> > drill-bit nodes and S3 is not optimal. >> > >> > Regards, >> > -Stefan >> > >>
