So we are off to a flying start :)
On Thu, Oct 29, 2015 at 9:50 PM, Stefán Baxter <[email protected]>
wrote:
> Hi,
>
> We are using Avro, JSON and Parquet for collection various types of data
> for analytical processing.
>
> I have not used Parquet before we starting to play around with Drill and
> now I'm wondering if we are planing our data structures correctly and if we
> will be able to get the most out of Drill+Parquet.
>
> I have some questions and I hope the answers can be turned into a Best
> Practices document.
>
> So here we go:
>
> - Are there any rules that we must abide by to make scanning of
> "low-cardinality" columns as effective as they can be?
> - I understand it so that the Parquet dictionary is scanned for the
> value(s) and if they are not in the dictionary that the section is ignored
>
> - Can dictionary based scanning (as described above) work on arrays?
> - like: {"some":"simple","tags":["blue","green","yellow"]}
>
> - If I have multiple files containing a days worth of logging, in
> chronological order, will all the irrelevant files be ignored when looking
> for a data or a date range?
> - AKA - Will the min-max headers in Parquet be used to prevent
> scanning of data outside the range?
>
> - Is there anything I need to do to make sure that the write
> optimizations in Parquet are used?
> - dictionaries for low cardinality fields
> - "number folding" for numerical sequences
> - compression etc.
>
> - Are there any Parquet features that are not available in Parquet?
> - I know Drill is using a fork of Parquet and I wonder if any major
> improvements in parquet are unavailable
>
> - Storing Dates with timezone information (stored in two separate
> fields?)
> - What is the common approach?
>
> - Are there any caveats in converting Avro to Parquet?
> - other than to convert unix dates from Avor (only long
> available) into timsetamp fields in Parquet
>
>
> There will, in all likelihood, be future installment to this entry as new
> questions arise.
>
> All help is appreciated.
>
> Regards,
> -Stefan
>
>
>