Re: Querying parquet files

Jason Altekruse Tue, 07 Jul 2015 09:34:31 -0700

While the format is columnar and we are taking advantage of certain aspects
of the layout, we do not split the read between columns, but instead by the
block abstraction in Parquet which they call Row Groups. Each of these
blocks will contain data from each column, forming a complete set of rows.

If you want to get more parallelization reading parquet files, you will
need to generate them with smaller blocks. If you are using Drill to
generate them you can generate smaller blocks by setting
`store.parquet.block-size` to a smaller size.

That being said, there are some other important considerations here that I
think will impact your use case. First off, you want to try to align the
data as closely to the FS block size as possible, or make sure that the
data is small enough that two whole row groups will fit in a block (unless
your FS blocks are quite large this is unlikely the right choice). This is
to eliminate the risk that a row group will span across blocks, in which
case you risk double reading a block.

All of this in mind, I think there might be a further expectation you have
of Drill that we are not providing today. This statement from your first
e-mail gives me this impression.

>However, running a 'select *" query against the same record using the same
criteria  is somewhat slow

Today in Drill if you run a select * query, we will read all of the data in
the table, unless the storage system supports filter pushdown.
Unfortunately today this is not implemented in the parquet reader. If you
select all of the columns with a filter, we will be reading all of the data
and sending it all through a downstream filter operation. The main
optimization we have available for parquet today is project pushdown, as
you have seen, we will read a subset of the columns if you request a subset.

We do have a set of workarounds for this limitation, namely support for
partitioning and partition pruning. Previously this would need to be done
with manual partition of the data into folders (where a folder contained
all of the data for one of the partition values) and then running a query
with a filter on the directory columns we expose dir0, dir1, dir2, etc.

In the 1.1 release we introduced auto-partitioning, which simplifies this
significantly. For any columns that you are likely to filter on, and that
contain a reasonably small number of unique values, you can specify a
series of columns you would like to partition your data on. Drill will
automatically write out separate files for each partition, and you can run
queries filtering on the partition columns, and we will plan reads of only
the necessary files. Read here for for info:

https://drill.apache.org/docs/partition-by-clause/

On Tue, Jul 7, 2015 at 8:44 AM, Yousef Lasi <[email protected]> wrote:

> We are currently running(testing) with Veritas CFS (attached to EMC SAN
> storage) which is visible across 6 servers. We also have a single test MapR
> node, but that's a small sandbox. The production implementation will be
> with a 10 node HDFS cluster
>
> The data files are 20 GB to 40 GB in size.
>
>
> July 7 2015 11:34 AM, "Ted Dunning" <[email protected]> wrote:
> > No.  A very simple model like that breaks down on many levels. The most
> important level that
> > reality intrudes in is the fact that your I/O probably can't really be
> threaded so widely.
> >
> > What kind of storage are you using? How big is your data?
> >
> > Sent from my iPhone
> >
> >> On Jul 7, 2015, at 6:38, "Yousef Lasi" <[email protected]> wrote:
> >>
> >> Am I correct in assuming that it will allocate a thread for each core
> available to all drill bits
> >> and read x numbers of columns in parallel? so if we have 48 cores
> available and the file has 48
> >> columns, then the time for the query for a single column should roughly
> equal the time for 48
> >> columns? All other factors, such as data types being the same of course.
>

Re: Querying parquet files

Reply via email to