Re: Auto-splitting delimitted files

Ted Dunning Thu, 21 May 2015 09:17:45 -0700

Can you publish the test queries and associated logical and physical plans?




On Thu, May 21, 2015 at 7:06 AM, Yousef Lasi <[email protected]> wrote:

> We do expect to use MapRFS at some point so data locality will be
> available to Drill once that happens. In the interim, we're trying to
> leverage Drill to pre-process large data sets. As an example, we're
> creating a view into a join across 4 large files (the largest of which is
> 20 GB). This join currently takes about 40 minutes on single server using a
> local file system. By manually splitting the files, we gain some
> performance as the elapsed time drops down to ~ 30 minutes.
>
> The part where we get a little lost is in understanding the optimization
> process. Based on the query plan, it appears that the majority of the time
> is spent on the hash joins. Logically, it would make sense that if we split
> the files into smaller chunks we would gain increasing efficiency. However,
> this doesn't appear to be the case as we're not really getting much
> improvement beyond the 30 minute range despite increasing parallelization
> by adding additional drill bits and file partitions.
>
>
> May 21 2015 12:55 AM, "Ted Dunning" <[email protected]> wrote:
> > Drill loses locality information on anything but an HDFS oriented file
> > system.  That might be part of what you are observing.  Having pre-split
> > files should allow parallelism.
> >
> > Can you describe your experiments in more detail?
> >
> > Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?
> >
> > It might help you if you check out the MapR community edition.  That
> would
> > give you a more standard view of a shared file system since it allows
> > distributed NFS service.  You also don't have to worry about the
> > implications of having an object store under your file system as with
> > Ceph.  Instead, the cluster (made up of any machines you have) would
> > present as a *very* standard file system with the exception of locking.
> > This would have the side effect of letting you experiment on the same
> data
> > from both kinds of API (NFS and HDFS) to check for differences.
> >
> > On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <[email protected]>
> wrote:
> >
> >> It appears that we will be implementing Drill before our Hadoop
> >> infrastructure is ready for production. A question that's come up
> related
> >> to deploying Drill on clustered
> >> Linux hosts (i.e. hosts with a shared file system but no HDFS) is
> whether
> >> Drill parallelization can take advantage of multiple drill bits in this
> >> scenario.
> >>
> >> Should we expect Drill to auto-split large CSV files and read/sort them
> >> in parallel? That does not appear to happen in our testing. We've had to
> >> manually partition large files into sets of files stored in a shared
> folder.
> >>
> >> Is there any value to having multiple drill bits with access to the same
> >> shared files in CFS/GFS?
> >>
> >> Thanks
>

Re: Auto-splitting delimitted files

Reply via email to