Re: Auto-splitting delimitted files

Yousef Lasi Thu, 21 May 2015 11:12:03 -0700

I've sent the full JSON profile of the query in a separate mail message.


May 21 2015 12:16 PM, "Ted Dunning" <[email protected]> wrote: 
> Can you publish the test queries and associated logical and physical plans?
> 
> On Thu, May 21, 2015 at 7:06 AM, Yousef Lasi <[email protected]> wrote:
> 
>> We do expect to use MapRFS at some point so data locality will be
>> available to Drill once that happens. In the interim, we're trying to
>> leverage Drill to pre-process large data sets. As an example, we're
>> creating a view into a join across 4 large files (the largest of which is
>> 20 GB). This join currently takes about 40 minutes on single server using a
>> local file system. By manually splitting the files, we gain some
>> performance as the elapsed time drops down to ~ 30 minutes.
>> 
>> The part where we get a little lost is in understanding the optimization
>> process. Based on the query plan, it appears that the majority of the time
>> is spent on the hash joins. Logically, it would make sense that if we split
>> the files into smaller chunks we would gain increasing efficiency. However,
>> this doesn't appear to be the case as we're not really getting much
>> improvement beyond the 30 minute range despite increasing parallelization
>> by adding additional drill bits and file partitions.
>> 
>> May 21 2015 12:55 AM, "Ted Dunning" <[email protected]> wrote:
>>> Drill loses locality information on anything but an HDFS oriented file
>>> system.  That might be part of what you are observing.  Having pre-split
>>> files should allow parallelism.
>>> 
>>> Can you describe your experiments in more detail?
>>> 
>>> Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?
>>> 
>>> It might help you if you check out the MapR community edition.  That
>> would
>>> give you a more standard view of a shared file system since it allows
>>> distributed NFS service.  You also don't have to worry about the
>>> implications of having an object store under your file system as with
>>> Ceph.  Instead, the cluster (made up of any machines you have) would
>>> present as a *very* standard file system with the exception of locking.
>>> This would have the side effect of letting you experiment on the same
>> data
>>> from both kinds of API (NFS and HDFS) to check for differences.
>>> 
>>> On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <[email protected]>
>> wrote:
>>> 
>>>> It appears that we will be implementing Drill before our Hadoop
>>>> infrastructure is ready for production. A question that's come up
>> related
>>>> to deploying Drill on clustered
>>>> Linux hosts (i.e. hosts with a shared file system but no HDFS) is
>> whether
>>>> Drill parallelization can take advantage of multiple drill bits in this
>>>> scenario.
>>>> 
>>>> Should we expect Drill to auto-split large CSV files and read/sort them
>>>> in parallel? That does not appear to happen in our testing. We've had to
>>>> manually partition large files into sets of files stored in a shared
>> folder.
>>>> 
>>>> Is there any value to having multiple drill bits with access to the same
>>>> shared files in CFS/GFS?
>>>> 
>>>> Thanks

Re: Auto-splitting delimitted files

Reply via email to