thanks david and stefan :) On Sun, Jul 26, 2015 at 12:40 AM, Stefán Baxter <[email protected]> wrote:
> Hi Hafiz, > > We are trying to discover if the Tachyon project ( > http://tachyon-project.org/) can be used as a bridge between the two > worlds > (local S3 cache with some built in intelligence). > > Others may have know of alternative "S3 sweeteners". > > Regards, > -Stefan > > > > On Sat, Jul 25, 2015 at 7:32 PM, David Tucker <[email protected]> > wrote: > > > One more thing to remember ... S3 is an object store, not a file system > in > > the traditional sense. That means that when a drill-bit accesses a file > > from S3, the whole thing is transferred ... whether it's 100 bytes or 100 > > megabytes. The advantages of the parquet format are far more obvious > in a > > file-system environment, where basic operations like lseek and partial > file > > reads are supported. > > > > Stefan is absolutely correct that you're still better off with parquet > > files ... if only because the absolute volume of data you'll be pulling > in > > from S3 will be reduced. In terms of file size, you'll likely see > better > > performance with larger files rather than smaller ones (thousands of > > GB-sized files for your TB of data rather than millions of MB-sized > > files). This will definitely be a balancing act; you'll want to test > the > > scaling __slowly__ and identify the sweet spot. IMHO, you really may be > > better off pulling the data down from S3 onto local storage if you'll be > > accessing it with multiple queries. Ephemeral storage on Amazon EC2 > > instances is fairly cheap if you're only talking about a few TB. > > > > -- David > > > > On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <[email protected]> > > wrote: > > > > > Thanks alot Stefan :) > > > > > > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter < > > [email protected]> > > > wrote: > > > > > >> Hi, > > >> > > >> I'm pretty new around here but let me attempt to answer you. > > >> > > >> - Parquet will always be (a lot) faster than CSV, especially if your > > >> querying for only a part of the columns in the CSV > > >> - Parquet is has various compression techniques and is more "scan > > >> friendly" (optimized for scanning compressed data) > > >> > > >> - The optimal filesize is linked to the fs segment sizes (I'm not > sure > > >> how that effects S3) and block sizes > > >> - hava a look at this: > > >> http://ingest.tips/2015/01/31/parquet-row-group-size/ > > >> > > >> - Read up on partitioning of Parquet file that is supported by Drill > > and > > >> can improve your performance quite a bit > > >> - partitioning can help you with efficiently filter data and > prevents > > >> scanning of data not relevant to your query > > >> > > >> - Spend a little bit of time to plan how your will map your CSV to > > >> Parquet to make sure columns are imported as the appropriate data > type > > >> - this matters in compression and efficiency (storing numbers as > > string, > > >> for example, will prevent Parquet for doing some optimization magick > > >> - See this: > > >> http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 > > (or > > >> some of the other presentations on Parquet) > > >> > > >> - Optimize your drillbits (Drill machines) so they are sharing the > > >> workload > > >> > > >> - Get to know #3 best practices > > >> - https://www.youtube.com/watch?v=_FHRzq7eHQc > > >> - https://aws.amazon.com/articles/1904 > > >> > > >> Hope this helps, > > >> -Stefan > > >> > > >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid < > > [email protected]> > > >> wrote: > > >> > > >>> Hi! > > >>> > > >>> I have terabytes of data on S3 and I want to query this data using > > >> drill. I > > >>> want to know at which format of data drill gives best performance. > > >> whether > > >>> CSV format will be best or parquet format? Also what should be file > > size? > > >>> whether small files will be more appropriate for drill or large > files? > > >>> > > >>> > > >>> Thanks > > >>> > > >> > > > > > > > > > > > > -- > > > Regards: HAFIZ MUJADID > > > > > -- Regards: HAFIZ MUJADID
