Hi Drillers,

I have been meaning to share some thoughts on Drill for a long time and
what I, or we at Activity Stream, believe would make Drill better (for us).
Please keep in mind that this is a single sided view from a simple,
non-contributing, user and please excuse my English.

We love using Drill and our setup included Drill, Parquet, Avro, JSON, JDBC
sources and more. Drill offers many great things but in the beginning it
affected our decision to use Drill, over Presto, that we could use it with
both Hive/HDFS and local disc storage and its support for the various data
sources.

Working with Drill has not always been easy and we have spent a lot of time
adjusting to "Drill quirks", like defaulting to Double for values that show
up to late, but the "this is awesome" moments have  always been more
frequent than the "I don't believe this s**t" moments (please excuse the
language).

We see the main roles of Drill as the following:

   - Run distributed and fast SQL on top of various data sources and allow
   us to mix the data into a single result
   - Eliminate ETL by supporting evolving schema


Some discussion points:

*1. Null exists, let's use it!  (some pun intended)*

   - If a field is missing let's return Null
   - Schema validation is great but in a polyglot and mixed schema
   environment that should not surprise anyone.
   - Drill has a bunch of functions to deal with null values and

*2. String is the lowest common denominator*

   - Almost all values can be converted to and from Strings

   - Let's use Strings as the default value type if values are missing
   - Instead of Double (pet peeve)

   - Lets always convert String values automatically, if functions are
   expecting other value types and the value is applicable for conversion
   - Create a warning that this is being done when it's affecting
   performance rather than throw errors

*3. Be as tolerant towards data as possible - Log warning rather than throw
errors*

Let's minimize the "conversion boiler plaiting" needed in SQL by having a
more flexible infrastructure.

   - ISO Date-String, Time-Stamp and Long are all valid Dates, let's treat
   them as such for any function or conditions

   - Integers can be accurately converted to Real/Double, let's not make
   that difference matter (Going the other way is not the same)

   - Other pointers
      - Missing tables in a union could return empty data sets rather than
      throw errors
      - Empty files should always return empty result sets
      - Valid JSON files, starting with "[" and ending with "]" should be
      trimmed to be suitable for Drill
      - It seems odd that Drill only supports non-standard lists
      - A "incomplete last line" in a any log file (JSON, CSV etc.) should
      be ignored as it could represent an incomplete append operation
(live logs)

*4. Consistency between data/storage formats if at all possible*

Having different behavior in Parquet and Avro, for example, when it comes
to missing fields is counter intuitive and appears fragmented.

Please synchronize the way "Drill behaves" rather than fragment on how
every single format reader behaves.



This is by no means a conclusive list but I just wanted to see if I could
get this ball rolling.


Hope you are all enjoying the holidays.

Best regards,
 -Stefán

ps.
Our only contribution to Drill is this simple UDF library:
https://github.com/activitystream/asdrill (Apache license)

Reply via email to