Can you publish the test queries and associated logical and physical plans?
On Thu, May 21, 2015 at 7:06 AM, Yousef Lasi <[email protected]> wrote: > We do expect to use MapRFS at some point so data locality will be > available to Drill once that happens. In the interim, we're trying to > leverage Drill to pre-process large data sets. As an example, we're > creating a view into a join across 4 large files (the largest of which is > 20 GB). This join currently takes about 40 minutes on single server using a > local file system. By manually splitting the files, we gain some > performance as the elapsed time drops down to ~ 30 minutes. > > The part where we get a little lost is in understanding the optimization > process. Based on the query plan, it appears that the majority of the time > is spent on the hash joins. Logically, it would make sense that if we split > the files into smaller chunks we would gain increasing efficiency. However, > this doesn't appear to be the case as we're not really getting much > improvement beyond the 30 minute range despite increasing parallelization > by adding additional drill bits and file partitions. > > > May 21 2015 12:55 AM, "Ted Dunning" <[email protected]> wrote: > > Drill loses locality information on anything but an HDFS oriented file > > system. That might be part of what you are observing. Having pre-split > > files should allow parallelism. > > > > Can you describe your experiments in more detail? > > > > Also, what specifically do you mean by CFS and GFS? Ceph and Gluster? > > > > It might help you if you check out the MapR community edition. That > would > > give you a more standard view of a shared file system since it allows > > distributed NFS service. You also don't have to worry about the > > implications of having an object store under your file system as with > > Ceph. Instead, the cluster (made up of any machines you have) would > > present as a *very* standard file system with the exception of locking. > > This would have the side effect of letting you experiment on the same > data > > from both kinds of API (NFS and HDFS) to check for differences. > > > > On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <[email protected]> > wrote: > > > >> It appears that we will be implementing Drill before our Hadoop > >> infrastructure is ready for production. A question that's come up > related > >> to deploying Drill on clustered > >> Linux hosts (i.e. hosts with a shared file system but no HDFS) is > whether > >> Drill parallelization can take advantage of multiple drill bits in this > >> scenario. > >> > >> Should we expect Drill to auto-split large CSV files and read/sort them > >> in parallel? That does not appear to happen in our testing. We've had to > >> manually partition large files into sets of files stored in a shared > folder. > >> > >> Is there any value to having multiple drill bits with access to the same > >> shared files in CFS/GFS? > >> > >> Thanks >
