Hi Hafiz,

We are trying to discover if the Tachyon project (
http://tachyon-project.org/) can be used as a bridge between the two worlds
(local S3 cache with some built in intelligence).

Others may have know of  alternative "S3 sweeteners".

Regards,
 -Stefan



On Sat, Jul 25, 2015 at 7:32 PM, David Tucker <[email protected]> wrote:

> One more thing to remember ... S3 is an object store, not a file system in
> the traditional sense.   That means that when a drill-bit accesses a file
> from S3, the whole thing is transferred ... whether it's 100 bytes or 100
> megabytes.   The advantages of the parquet format are far more obvious in a
> file-system environment, where basic operations like lseek and partial file
> reads are supported.
>
> Stefan is absolutely correct that you're still better off with parquet
> files ... if only because the absolute volume of data you'll be pulling in
> from S3 will be reduced.   In terms of file size, you'll likely see better
> performance with larger files rather than smaller ones (thousands of
> GB-sized files for your TB of data rather than millions of  MB-sized
> files).   This will definitely be a balancing act; you'll want to test the
> scaling __slowly__ and identify the sweet spot.   IMHO, you really may be
> better off pulling the data down from S3 onto local storage if you'll be
> accessing it with multiple queries.   Ephemeral storage on Amazon EC2
> instances is fairly cheap if you're only talking about a few TB.
>
> -- David
>
> On Jul 25, 2015, at 7:16 AM, Hafiz Mujadid <[email protected]>
> wrote:
>
> > Thanks alot Stefan :)
> >
> > On Sat, Jul 25, 2015 at 2:58 PM, Stefán Baxter <
> [email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> I'm pretty new around here but let me attempt to answer you.
> >>
> >>   - Parquet will always be (a lot) faster than CSV, especially if your
> >>   querying for only a part of the columns in the CSV
> >>   - Parquet is has various compression techniques and is more "scan
> >>   friendly" (optimized for scanning compressed data)
> >>
> >>   - The optimal filesize is linked to the fs segment sizes (I'm not sure
> >>   how that effects S3) and block sizes
> >>   - hava a look at this:
> >>   http://ingest.tips/2015/01/31/parquet-row-group-size/
> >>
> >>   - Read up on partitioning of Parquet file that is supported by Drill
> and
> >>   can improve your performance quite a bit
> >>   - partitioning can help you with efficiently filter data and prevents
> >>   scanning of data not relevant to your query
> >>
> >>   - Spend a little bit of time to plan how your will map your CSV to
> >>   Parquet to make sure columns are imported as the appropriate data type
> >>   - this matters in compression and efficiency (storing numbers as
> string,
> >>   for example, will prevent Parquet for doing some optimization magick
> >>   - See this:
> >>   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2
> (or
> >>   some of the other presentations on Parquet)
> >>
> >>   - Optimize your drillbits (Drill machines) so they are sharing the
> >>   workload
> >>
> >>   - Get to know #3 best practices
> >>   - https://www.youtube.com/watch?v=_FHRzq7eHQc
> >>   - https://aws.amazon.com/articles/1904
> >>
> >> Hope this helps,
> >> -Stefan
> >>
> >> On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <
> [email protected]>
> >> wrote:
> >>
> >>> Hi!
> >>>
> >>> I have terabytes of data on S3 and I want to query this data using
> >> drill. I
> >>> want to know at which format of data drill gives best performance.
> >> whether
> >>> CSV format will be best or parquet format? Also what should be file
> size?
> >>> whether small files will be more appropriate for drill or large files?
> >>>
> >>>
> >>> Thanks
> >>>
> >>
> >
> >
> >
> > --
> > Regards: HAFIZ MUJADID
>
>

Reply via email to