Re: Best Performance of drill

Hafiz Mujadid Sat, 25 Jul 2015 13:11:25 -0700

thanks david and stefan :)

On Sun, Jul 26, 2015 at 12:40 AM, Stefán Baxter <[email protected]>
wrote:


> Hi Hafiz,
>
> We are trying to discover if the Tachyon project (
> http://tachyon-project.org/) can be used as a bridge between the two
> worlds
> (local S3 cache with some built in intelligence).
>
> Others may have know of  alternative "S3 sweeteners".
>
> Regards,
>  -Stefan
>
>
>
> On Sat, Jul 25, 2015 at 7:32 PM, David Tucker <[email protected]>
> wrote:
>
> > One more thing to remember ... S3 is an object store, not a file system
> in
> > the traditional sense.   That means that when a drill-bit accesses a file
> > from S3, the whole thing is transferred ... whether it's 100 bytes or 100
> > megabytes.   The advantages of the parquet format are far more obvious
> in a
> > file-system environment, where basic operations like lseek and partial
> file
> > reads are supported.
> >
> > Stefan is absolutely correct that you're still better off with parquet
> > files ... if only because the absolute volume of data you'll be pulling
> in
> > from S3 will be reduced.   In terms of file size, you'll likely see
> better
> > performance with larger files rather than smaller ones (thousands of
> > GB-sized files for your TB of data rather than millions of  MB-sized
> > files).   This will definitely be a balancing act; you'll want to test
> the
> > scaling __slowly__ and identify the sweet spot.   IMHO, you really may be
> > better off pulling the data down from S3 onto local storage if you'll be
> > accessing it with multiple queries.   Ephemeral storage on Amazon EC2
> > instances is fairly cheap if you're only talking about a few TB.
> >
> > -- David
> >
> > On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <[email protected]>
> > wrote:
> >
> > > Thanks alot Stefan :)
> > >
> > > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <
> > [email protected]>
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm pretty new around here but let me attempt to answer you.
> > >>
> > >>   - Parquet will always be (a lot) faster than CSV, especially if your
> > >>   querying for only a part of the columns in the CSV
> > >>   - Parquet is has various compression techniques and is more "scan
> > >>   friendly" (optimized for scanning compressed data)
> > >>
> > >>   - The optimal filesize is linked to the fs segment sizes (I'm not
> sure
> > >>   how that effects S3) and block sizes
> > >>   - hava a look at this:
> > >>   http://ingest.tips/2015/01/31/parquet-row-group-size/
> > >>
> > >>   - Read up on partitioning of Parquet file that is supported by Drill
> > and
> > >>   can improve your performance quite a bit
> > >>   - partitioning can help you with efficiently filter data and
> prevents
> > >>   scanning of data not relevant to your query
> > >>
> > >>   - Spend a little bit of time to plan how your will map your CSV to
> > >>   Parquet to make sure columns are imported as the appropriate data
> type
> > >>   - this matters in compression and efficiency (storing numbers as
> > string,
> > >>   for example, will prevent Parquet for doing some optimization magick
> > >>   - See this:
> > >>   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2
> > (or
> > >>   some of the other presentations on Parquet)
> > >>
> > >>   - Optimize your drillbits (Drill machines) so they are sharing the
> > >>   workload
> > >>
> > >>   - Get to know #3 best practices
> > >>   - https://www.youtube.com/watch?v=_FHRzq7eHQc
> > >>   - https://aws.amazon.com/articles/1904
> > >>
> > >> Hope this helps,
> > >> -Stefan
> > >>
> > >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <
> > [email protected]>
> > >> wrote:
> > >>
> > >>> Hi!
> > >>>
> > >>> I have terabytes of data on S3 and I want to query this data using
> > >> drill. I
> > >>> want to know at which format of data drill gives best performance.
> > >> whether
> > >>> CSV format will be best or parquet format? Also what should be file
> > size?
> > >>> whether small files will be more appropriate for drill or large
> files?
> > >>>
> > >>>
> > >>> Thanks
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Regards: HAFIZ MUJADID
> >
> >
>



-- 
Regards: HAFIZ MUJADID

Re: Best Performance of drill

Reply via email to