Hi Hafiz, We are trying to discover if the Tachyon project ( http://tachyon-project.org/) can be used as a bridge between the two worlds (local S3 cache with some built in intelligence).
Others may have know of alternative "S3 sweeteners". Regards, -Stefan On Sat, Jul 25, 2015 at 7:32 PM, David Tucker <[email protected]> wrote: > One more thing to remember ... S3 is an object store, not a file system in > the traditional sense. That means that when a drill-bit accesses a file > from S3, the whole thing is transferred ... whether it's 100 bytes or 100 > megabytes. The advantages of the parquet format are far more obvious in a > file-system environment, where basic operations like lseek and partial file > reads are supported. > > Stefan is absolutely correct that you're still better off with parquet > files ... if only because the absolute volume of data you'll be pulling in > from S3 will be reduced. In terms of file size, you'll likely see better > performance with larger files rather than smaller ones (thousands of > GB-sized files for your TB of data rather than millions of MB-sized > files). This will definitely be a balancing act; you'll want to test the > scaling __slowly__ and identify the sweet spot. IMHO, you really may be > better off pulling the data down from S3 onto local storage if you'll be > accessing it with multiple queries. Ephemeral storage on Amazon EC2 > instances is fairly cheap if you're only talking about a few TB. > > -- David > > On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <[email protected]> > wrote: > > > Thanks alot Stefan :) > > > > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter < > [email protected]> > > wrote: > > > >> Hi, > >> > >> I'm pretty new around here but let me attempt to answer you. > >> > >> - Parquet will always be (a lot) faster than CSV, especially if your > >> querying for only a part of the columns in the CSV > >> - Parquet is has various compression techniques and is more "scan > >> friendly" (optimized for scanning compressed data) > >> > >> - The optimal filesize is linked to the fs segment sizes (I'm not sure > >> how that effects S3) and block sizes > >> - hava a look at this: > >> http://ingest.tips/2015/01/31/parquet-row-group-size/ > >> > >> - Read up on partitioning of Parquet file that is supported by Drill > and > >> can improve your performance quite a bit > >> - partitioning can help you with efficiently filter data and prevents > >> scanning of data not relevant to your query > >> > >> - Spend a little bit of time to plan how your will map your CSV to > >> Parquet to make sure columns are imported as the appropriate data type > >> - this matters in compression and efficiency (storing numbers as > string, > >> for example, will prevent Parquet for doing some optimization magick > >> - See this: > >> http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 > (or > >> some of the other presentations on Parquet) > >> > >> - Optimize your drillbits (Drill machines) so they are sharing the > >> workload > >> > >> - Get to know #3 best practices > >> - https://www.youtube.com/watch?v=_FHRzq7eHQc > >> - https://aws.amazon.com/articles/1904 > >> > >> Hope this helps, > >> -Stefan > >> > >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid < > [email protected]> > >> wrote: > >> > >>> Hi! > >>> > >>> I have terabytes of data on S3 and I want to query this data using > >> drill. I > >>> want to know at which format of data drill gives best performance. > >> whether > >>> CSV format will be best or parquet format? Also what should be file > size? > >>> whether small files will be more appropriate for drill or large files? > >>> > >>> > >>> Thanks > >>> > >> > > > > > > > > -- > > Regards: HAFIZ MUJADID > >
