While the format is columnar and we are taking advantage of certain aspects of the layout, we do not split the read between columns, but instead by the block abstraction in Parquet which they call Row Groups. Each of these blocks will contain data from each column, forming a complete set of rows.
If you want to get more parallelization reading parquet files, you will need to generate them with smaller blocks. If you are using Drill to generate them you can generate smaller blocks by setting `store.parquet.block-size` to a smaller size. That being said, there are some other important considerations here that I think will impact your use case. First off, you want to try to align the data as closely to the FS block size as possible, or make sure that the data is small enough that two whole row groups will fit in a block (unless your FS blocks are quite large this is unlikely the right choice). This is to eliminate the risk that a row group will span across blocks, in which case you risk double reading a block. All of this in mind, I think there might be a further expectation you have of Drill that we are not providing today. This statement from your first e-mail gives me this impression. >However, running a 'select *" query against the same record using the same criteria is somewhat slow Today in Drill if you run a select * query, we will read all of the data in the table, unless the storage system supports filter pushdown. Unfortunately today this is not implemented in the parquet reader. If you select all of the columns with a filter, we will be reading all of the data and sending it all through a downstream filter operation. The main optimization we have available for parquet today is project pushdown, as you have seen, we will read a subset of the columns if you request a subset. We do have a set of workarounds for this limitation, namely support for partitioning and partition pruning. Previously this would need to be done with manual partition of the data into folders (where a folder contained all of the data for one of the partition values) and then running a query with a filter on the directory columns we expose dir0, dir1, dir2, etc. In the 1.1 release we introduced auto-partitioning, which simplifies this significantly. For any columns that you are likely to filter on, and that contain a reasonably small number of unique values, you can specify a series of columns you would like to partition your data on. Drill will automatically write out separate files for each partition, and you can run queries filtering on the partition columns, and we will plan reads of only the necessary files. Read here for for info: https://drill.apache.org/docs/partition-by-clause/ On Tue, Jul 7, 2015 at 8:44 AM, Yousef Lasi <[email protected]> wrote: > We are currently running(testing) with Veritas CFS (attached to EMC SAN > storage) which is visible across 6 servers. We also have a single test MapR > node, but that's a small sandbox. The production implementation will be > with a 10 node HDFS cluster > > The data files are 20 GB to 40 GB in size. > > > July 7 2015 11:34 AM, "Ted Dunning" <[email protected]> wrote: > > No. A very simple model like that breaks down on many levels. The most > important level that > > reality intrudes in is the fact that your I/O probably can't really be > threaded so widely. > > > > What kind of storage are you using? How big is your data? > > > > Sent from my iPhone > > > >> On Jul 7, 2015, at 6:38, "Yousef Lasi" <[email protected]> wrote: > >> > >> Am I correct in assuming that it will allocate a thread for each core > available to all drill bits > >> and read x numbers of columns in parallel? so if we have 48 cores > available and the file has 48 > >> columns, then the time for the query for a single column should roughly > equal the time for 48 > >> columns? All other factors, such as data types being the same of course. >
