Re: Querying partitioned Parquet files

Jason Altekruse Thu, 30 Jul 2015 09:47:43 -0700

I almost wrote that each node needed access to a common namespace, but I
decided to answer the question more in line with how it was originally
asked. As parth confirmed, your point is valid. If you are okay with
reading all data over the network, NFS is definitely an option (as it just
looks like part of the local disk, but is guaranteed to be available on all
of the machines if they have it mounted in to the same path).


I will however say that I would lean towards doing as this JIRA suggests
and disabling the fact that the non-NFS case happens to work if you have a
series of machines with the same files (or filenames) on all of their local
disks. It's just too fragile and will likely produce false assumptions. I
do however think this web log querying use case that I described, which
would require a some core enhancements, should be considered as a strong
potential use case for drill.

On Wed, Jul 29, 2015 at 10:13 PM, Parth Chandra <[email protected]> wrote:

> Yes that would work too, though if there are inconsistencies in the copies
> of files made, then the results would be unreliable.
>
> Parth
>
> On Wed, Jul 29, 2015 at 6:45 PM, Adam Gilmore <[email protected]>
> wrote:
>
> > Just to clarify this, Jason - you don't necessarily need HDFS or the like
> > for this, if you had say a NFS volume (for example, Amazon Elastic File
> > System), you can still accomplish it, right?  Or merely if you had all
> > files duplicated on every node locally.
> >
> > On Thu, Jul 30, 2015 at 10:00 AM, Jason Altekruse <
> > [email protected]>
> > wrote:
> >
> > > Put a little more simply, the node that we end up planning the query on
> > is
> > > going to enumerate the files we will be reading in the query so that we
> > can
> > > assign work to given nodes. This currently assumes we are going to know
> > at
> > > planning time (on the single node) all of the files to be read. This
> > > happens to work in a single node setup, because all of the work will be
> > > done on the single machine against one filesystem (the local fs). In
> the
> > > distributed case we currently require that we have a connection from
> each
> > > node to a DFS.
> > >
> > > There is an outstanding feature request to support a use case like
> > querying
> > > a series of server logs, each machine may have a different number of
> log
> > > files. We will need to modify the planning process to allow for the
> > > description of a scan that is more flexible and allows enumerating the
> > > files on each machine separately when we go to actually read them.
> > >
> > > This JIRA discusses the issue you are facing in more detail, I believe
> we
> > > should have one outstanding for the feature request as well. I will try
> > to
> > > take a look around for it and open one if I can't find it soon.
> > >
> > > https://issues.apache.org/jira/browse/DRILL-3230
> > >
> > > On Wed, Jul 29, 2015 at 4:14 PM, Kristine Hahn <[email protected]>
> > wrote:
> > >
> > > > Yes, you need a distributed file system to take advantage of Drill's
> > > query
> > > > planning. If you use multiple Drillbits and do not use a distributed
> > file
> > > > system, the consistency of the fragment information cannot be
> > maintained.
> > > >
> > > >
> > > >
> > > > Kristine Hahn
> > > > Sr. Technical Writer
> > > > 415-497-8107 @krishahn skype:krishahn
> > > >
> > > >
> > > > On Wed, Jul 29, 2015 at 4:37 AM, Geercken, Uwe <
> > > [email protected]
> > > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > If I have a list of partitioned parquet files on the filesystem and
> > two
> > > > > drillbits with access to the filesystem and I query the data using
> > the
> > > > > column I partitioned on in the where clause of the query, will both
> > > > > drillbits share the work?
> > > > >
> > > > > Or do I need a distributed filesystem such as Hadoop underlying to
> > make
> > > > > the bits work in parallel (or work together)?
> > > > >
> > > > > Tks.
> > > > >
> > > > > Uwe
> > > > >
> > > >
> > >
> >
>

Re: Querying partitioned Parquet files

Reply via email to