Hi Paul, This sounds interesting.
Can you elaborate a bit more on: Do all the files need to be loaded in memory or will it swap hot files in/out of memory? Have you used it with Drill / Parquet? Is the built-in columnar store relevant (Native store for row tables)? (I would think not) Regards, -Stefan On Tue, Jul 14, 2015 at 7:13 PM, Paul Mogren <[email protected]> wrote: > Stefan, > > You might be interested in http://tachyon-project.org > > > > > On 7/14/15, 1:12 PM, "Stefán Baxter" <[email protected]> wrote: > > >Hi, > > > >Thank you. > > > >I was not suggesting this to be a part of Drill, only asking if any > >experience exist in this area. :) > > > >I'm trying to evaluate S3-almost-only vs. HDFS so your points are handy. > > > >Regards, > > -Stefan > > > > > > > >On Tue, Jul 14, 2015 at 5:08 PM, Jason Altekruse > ><[email protected]> > >wrote: > > > >> I am not aware of anyone doing something like this today, but it seems > >>like > >> something best handled outside of Drill right now. Drill considers > >>itself > >> essentially stateless, we do not manage indexes, table constraints or > >> caching data for any of our current storage systems. There was some work > >> being done to cache Parquet metadata, in this case we were placing all > >>of > >> the parquet footers in a single file, which would need to be manually > >> refreshed. This work has not made it into the mainline, but you can > >>follow > >> the progress here: > >> > >> https://issues.apache.org/jira/browse/DRILL-2743 > >> > >> I would take a look around for general purpose local caching systems for > >> S3. To make these work with Drill today they will have to re-expose the > >> HDFS API. There might be something out there that already does this, > >>but as > >> some of the primary users of S3 are web application developers, they > >>might > >> not have worried about providing the HDFS API on top of any caching > >>systems > >> developed to date. > >> > >> One thing to note, the HDFS API is already available on top of the local > >> file system, this is what enables us to read from the local disk in > >> embedded mode. If you can get a caching system to expose NFS, you could > >> mount this to the same path on all of your nodes and it should be able > >>to > >> read from that path mounted on your local FS. > >> > >> > >> > >> On Tue, Jul 14, 2015 at 1:06 AM, Stefán Baxter > >><[email protected]> > >> wrote: > >> > >> > Hi, > >> > > >> > I'm wondering if the people that use Drill with S3 are using some > >>sort of > >> > local cache on the drillbit-nodes for historical, non changing, > >>Parquet > >> > segments. > >> > > >> > I'm pretty sure that I'm not using the correct terminology and that > >>the > >> > correct question is this: Are there any ways to optimize S3 with > >>drill so > >> > that "hot segments" are stored locally while hot and then just dropped > >> from > >> > local nodes when they are not. > >> > > >> > I guess this only really matters where networking speeds between the > >> > drill-bit nodes and S3 is not optimal. > >> > > >> > Regards, > >> > -Stefan > >> > > >> > >
