Re: Auto-splitting delimitted files

Ted Dunning Wed, 20 May 2015 21:55:40 -0700

Drill loses locality information on anything but an HDFS oriented file
system.  That might be part of what you are observing.  Having pre-split
files should allow parallelism.

Can you describe your experiments in more detail?

Also, what specifically do you mean by CFS and GFS?  Ceph and Gluster?

It might help you if you check out the MapR community edition.  That would
give you a more standard view of a shared file system since it allows
distributed NFS service.  You also don't have to worry about the
implications of having an object store under your file system as with
Ceph.  Instead, the cluster (made up of any machines you have) would
present as a *very* standard file system with the exception of locking.
This would have the side effect of letting you experiment on the same data
from both kinds of API (NFS and HDFS) to check for differences.

On Wed, May 20, 2015 at 1:25 PM, Yousef Lasi <yousef.l...@gmail.com> wrote:

> It appears that we will be implementing Drill before our Hadoop
> infrastructure is ready for production. A question that's come up related
> to deploying Drill on clustered
>  Linux hosts (i.e. hosts with a shared file system but no HDFS) is whether
> Drill parallelization can take advantage of multiple drill bits in this
> scenario.
>
>  Should we expect Drill to auto-split large CSV files and read/sort them
> in parallel? That does not appear to happen in our testing. We've had to
> manually partition large files into sets of files stored in a shared folder.
>
>  Is there any value to having multiple drill bits with access to the same
> shared files in CFS/GFS?
>
>  Thanks
>

Re: Auto-splitting delimitted files

Reply via email to