Thanks alot Stefan :) On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <[email protected]> wrote:
> Hi, > > I'm pretty new around here but let me attempt to answer you. > > - Parquet will always be (a lot) faster than CSV, especially if your > querying for only a part of the columns in the CSV > - Parquet is has various compression techniques and is more "scan > friendly" (optimized for scanning compressed data) > > - The optimal filesize is linked to the fs segment sizes (I'm not sure > how that effects S3) and block sizes > - hava a look at this: > http://ingest.tips/2015/01/31/parquet-row-group-size/ > > - Read up on partitioning of Parquet file that is supported by Drill and > can improve your performance quite a bit > - partitioning can help you with efficiently filter data and prevents > scanning of data not relevant to your query > > - Spend a little bit of time to plan how your will map your CSV to > Parquet to make sure columns are imported as the appropriate data type > - this matters in compression and efficiency (storing numbers as string, > for example, will prevent Parquet for doing some optimization magick > - See this: > http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or > some of the other presentations on Parquet) > > - Optimize your drillbits (Drill machines) so they are sharing the > workload > > - Get to know #3 best practices > - https://www.youtube.com/watch?v=_FHRzq7eHQc > - https://aws.amazon.com/articles/1904 > > Hope this helps, > -Stefan > > On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <[email protected]> > wrote: > > > Hi! > > > > I have terabytes of data on S3 and I want to query this data using > drill. I > > want to know at which format of data drill gives best performance. > whether > > CSV format will be best or parquet format? Also what should be file size? > > whether small files will be more appropriate for drill or large files? > > > > > > Thanks > > > -- Regards: HAFIZ MUJADID
