Yadid - Some answers (may not be complete):
>From what I understand, Drill enables concurrency by queuing requests. If we are preforming many reads, will writes to the same file be queued until completion of the reads ? This potentially could create a bottle neck - There are concurrency options, and I would suggest starting here to learn more about the options available: https://drill.apache.org/docs/configuring-a-multitenant-cluster-introduction/. The reads and writes will likely be less of contention in Drill, and more so in HDFS. (I could be wrong here, but an CTAS (insert) is just a query that will be queued. How does Drill manage parquet file partitioning, when using CTAS. Can we control horizontal / vertical partitioning in some way by configuring the drill bit ? - There are multiple options. One thing I use is the directory based partitioning. This is similar to Apache Hive. Basically if you have a directory, you can use that as the directory you query, and then use subdirectories exposed by the variable "dir0" So: table --- 2015-01-01 --- 2015-01-02 --- 2015-01-03 all the dates there are just directories under "table" so I could do select * from table where dir0 >= '2015-01-02"; and it would exclude the first directory. This is nice, you can create views of the table to make "dir0" into a fieldname like "part_date" (if you are using 2015-01-01 format for dates, fun tip: user to_date(dir0, 'yyyy-MM-dd') as part_date in your view to get better performance) This also plays nice in CTAS, because while you look at the table as table with subdirectories, you can do CTAS into the individual partitions CREATE TABLE `table\2015-01-04` as select field1, field2 from src_data. That way it loads the individual partition. Now, one challenge may be data that is being written while being queried. One hack I've used is to CREATE TABLE `table\.2015-01-04` (note the dot before the date). This puts it into a hidden directory, so while writing, the partition will not be read by other queries. Once it completes successfully, I then just do a mv .2015-01-14 2015-01-14 which is instant. There was some talk about a JIRA for a similar feature. Not sure if I did that yet :) Any alternative suggestions to the approach above? In terms of read performance, would this result in better performance (for columnar type data), than by using something like HBASE? -Without sounding too dismissive of the question, it really depends on your data. HBASE is a nice option for certain data when Parquet seems to not fit the bill. That said, I think with good partitioning, and with other improvements coming, Parquet can do quite well. John On Thu, Mar 10, 2016 at 12:26 PM, Yadid Ayzenberg <[email protected]> wrote: > Hi All, > > We are considering using drill to access data for large scale analytics on > top of parquet files stored on HDFS. > We would like to add data to this data-set in real-time, as it arrives > into our system. One propose solution was to use drill to perform both the > inserts and the selects on our data set. > > Some questions that arose: > > From what I understand, Drill enables concurrency by queuing requests. If > we are preforming many reads, will writes to the same file be queued until > completion of the reads ? This potentially could create a bottle neck > How does Drill manage parquet file partitioning, when using CTAS. Can we > control horizontal / vertical partitioning in some way by configuring the > drill bit ? > Any alternative suggestions to the approach above? In terms of read > performance, would this result in better performance (for columnar type > data), than by using something like HBASE? > Thanks, > > Yadid > > >
