I've sent the full JSON profile of the query in a separate mail message.
May 21 2015 12:16 PM, "Ted Dunning" <[email protected]> wrote: > Can you publish the test queries and associated logical and physical plans? > > On Thu, May 21, 2015 at 7:06 AM, Yousef Lasi <[email protected]> wrote: > >> We do expect to use MapRFS at some point so data locality will be >> available to Drill once that happens. In the interim, we're trying to >> leverage Drill to pre-process large data sets. As an example, we're >> creating a view into a join across 4 large files (the largest of which is >> 20 GB). This join currently takes about 40 minutes on single server using a >> local file system. By manually splitting the files, we gain some >> performance as the elapsed time drops down to ~ 30 minutes. >> >> The part where we get a little lost is in understanding the optimization >> process. Based on the query plan, it appears that the majority of the time >> is spent on the hash joins. Logically, it would make sense that if we split >> the files into smaller chunks we would gain increasing efficiency. However, >> this doesn't appear to be the case as we're not really getting much >> improvement beyond the 30 minute range despite increasing parallelization >> by adding additional drill bits and file partitions. >> >> May 21 2015 12:55 AM, "Ted Dunning" <[email protected]> wrote: >>> Drill loses locality information on anything but an HDFS oriented file >>> system. That might be part of what you are observing. Having pre-split >>> files should allow parallelism. >>> >>> Can you describe your experiments in more detail? >>> >>> Also, what specifically do you mean by CFS and GFS? Ceph and Gluster? >>> >>> It might help you if you check out the MapR community edition. That >> would >>> give you a more standard view of a shared file system since it allows >>> distributed NFS service. You also don't have to worry about the >>> implications of having an object store under your file system as with >>> Ceph. Instead, the cluster (made up of any machines you have) would >>> present as a *very* standard file system with the exception of locking. >>> This would have the side effect of letting you experiment on the same >> data >>> from both kinds of API (NFS and HDFS) to check for differences. >>> >>> On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <[email protected]> >> wrote: >>> >>>> It appears that we will be implementing Drill before our Hadoop >>>> infrastructure is ready for production. A question that's come up >> related >>>> to deploying Drill on clustered >>>> Linux hosts (i.e. hosts with a shared file system but no HDFS) is >> whether >>>> Drill parallelization can take advantage of multiple drill bits in this >>>> scenario. >>>> >>>> Should we expect Drill to auto-split large CSV files and read/sort them >>>> in parallel? That does not appear to happen in our testing. We've had to >>>> manually partition large files into sets of files stored in a shared >> folder. >>>> >>>> Is there any value to having multiple drill bits with access to the same >>>> shared files in CFS/GFS? >>>> >>>> Thanks
