One more thing to remember ... S3 is an object store, not a file system in the traditional sense. That means that when a drill-bit accesses a file from S3, the whole thing is transferred ... whether it's 100 bytes or 100 megabytes. The advantages of the parquet format are far more obvious in a file-system environment, where basic operations like lseek and partial file reads are supported.
Stefan is absolutely correct that you're still better off with parquet files ... if only because the absolute volume of data you'll be pulling in from S3 will be reduced. In terms of file size, you'll likely see better performance with larger files rather than smaller ones (thousands of GB-sized files for your TB of data rather than millions of MB-sized files). This will definitely be a balancing act; you'll want to test the scaling __slowly__ and identify the sweet spot. IMHO, you really may be better off pulling the data down from S3 onto local storage if you'll be accessing it with multiple queries. Ephemeral storage on Amazon EC2 instances is fairly cheap if you're only talking about a few TB. -- David On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <[email protected]> wrote: > Thanks alot Stefan :) > > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <[email protected]> > wrote: > >> Hi, >> >> I'm pretty new around here but let me attempt to answer you. >> >> - Parquet will always be (a lot) faster than CSV, especially if your >> querying for only a part of the columns in the CSV >> - Parquet is has various compression techniques and is more "scan >> friendly" (optimized for scanning compressed data) >> >> - The optimal filesize is linked to the fs segment sizes (I'm not sure >> how that effects S3) and block sizes >> - hava a look at this: >> http://ingest.tips/2015/01/31/parquet-row-group-size/ >> >> - Read up on partitioning of Parquet file that is supported by Drill and >> can improve your performance quite a bit >> - partitioning can help you with efficiently filter data and prevents >> scanning of data not relevant to your query >> >> - Spend a little bit of time to plan how your will map your CSV to >> Parquet to make sure columns are imported as the appropriate data type >> - this matters in compression and efficiency (storing numbers as string, >> for example, will prevent Parquet for doing some optimization magick >> - See this: >> http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or >> some of the other presentations on Parquet) >> >> - Optimize your drillbits (Drill machines) so they are sharing the >> workload >> >> - Get to know #3 best practices >> - https://www.youtube.com/watch?v=_FHRzq7eHQc >> - https://aws.amazon.com/articles/1904 >> >> Hope this helps, >> -Stefan >> >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <[email protected]> >> wrote: >> >>> Hi! >>> >>> I have terabytes of data on S3 and I want to query this data using >> drill. I >>> want to know at which format of data drill gives best performance. >> whether >>> CSV format will be best or parquet format? Also what should be file size? >>> whether small files will be more appropriate for drill or large files? >>> >>> >>> Thanks >>> >> > > > > -- > Regards: HAFIZ MUJADID
