Yadid - Some answers (may not be complete):

>From what I understand, Drill enables concurrency by queuing requests. If
we are preforming many reads, will writes to the same file be queued until
completion of the reads ? This potentially could create a bottle neck
- There are concurrency options, and I would suggest starting here to learn
more about the options available:
https://drill.apache.org/docs/configuring-a-multitenant-cluster-introduction/.
The reads and writes will likely be less of contention in Drill, and more
so in HDFS. (I could be wrong here, but an CTAS (insert) is just a query
that will be queued.

How does Drill manage parquet file partitioning, when using CTAS. Can we
control horizontal / vertical partitioning in some way by configuring the
drill bit ?
- There are multiple options. One thing I use is the directory based
partitioning. This is similar to Apache Hive.  Basically if you have a
directory, you can use that as the directory you query, and then use
subdirectories exposed by the variable "dir0"
So:

table
--- 2015-01-01
--- 2015-01-02
--- 2015-01-03

all the dates there are just directories under "table" so I could do select
* from table where dir0 >= '2015-01-02"; and it would exclude the first
directory.

This is nice, you can create views of the table to make "dir0" into a
fieldname like "part_date" (if you are using 2015-01-01 format for dates,
fun tip: user to_date(dir0, 'yyyy-MM-dd') as part_date  in your view to get
better performance)
This also plays nice in CTAS, because while you look at the table as table
with subdirectories, you can do CTAS into the individual partitions CREATE
TABLE `table\2015-01-04` as select field1, field2 from src_data.  That way
it loads the individual partition.

Now, one challenge may be data that is being written while being queried.
One hack I've used is to CREATE TABLE `table\.2015-01-04` (note the dot
before the date).  This puts it into a hidden directory, so while writing,
the partition will not be read by other queries. Once it completes
successfully, I then just do a mv .2015-01-14 2015-01-14 which is instant.

There was some talk about a JIRA for a similar feature. Not sure if I did
that yet :)

Any alternative suggestions to the approach above? In terms of read
performance, would this result in better performance (for columnar type
data), than by using something like HBASE?
-Without sounding too dismissive of the question, it really depends on your
data.  HBASE is a nice option for certain data when Parquet seems to not
fit the bill. That said, I think with good partitioning, and with other
improvements coming, Parquet can do quite well.

John



On Thu, Mar 10, 2016 at 12:26 PM, Yadid Ayzenberg <[email protected]>
wrote:

> Hi All,
>
> We are considering using drill to access data for large scale analytics on
> top of parquet files stored on HDFS.
> We would like to add data to this data-set in real-time, as it arrives
> into our system. One propose solution was to use drill to perform both the
> inserts and the selects on our data set.
>
> Some questions that arose:
>
> From what I understand, Drill enables concurrency by queuing requests. If
> we are preforming many reads, will writes to the same file be queued until
> completion of the reads ? This potentially could create a bottle neck
> How does Drill manage parquet file partitioning, when using CTAS. Can we
> control horizontal / vertical partitioning in some way by configuring the
> drill bit ?
> Any alternative suggestions to the approach above? In terms of read
> performance, would this result in better performance (for columnar type
> data), than by using something like HBASE?
> Thanks,
>
> Yadid
>
>
>

Reply via email to