I've read through your post and had similar thoughts on trying to get
together information around Parquet files .  I feel like it would be really
helpful for to have a section of the Drill User Docs dedicated to user
stories on Parquet files.  I know stories sound odd to put into
documentation, but I think that the challenge of explaining optimization of
something like Parquet is you can either do it from a dry academic point of
view, which can be hard for the user base to really understand, or you can
try to provide lots of stories that could be annotated by devs or improved
with links to other stories.

What I mean by stories is example data and how it it queried, and why it
was stored (with partitions based on directories, options for "loading"
data into directories, using partitions within the files, how Parquet
optimizes so folks know where to put extra effort into typing etc.)

As to your specific questions, I can't myself answer them,I've wondered
about some myself, but haven't gotten to asking them. My experiences with
Parquet have bene generally positive, but have involved a good amount of
trial and error (as you can see from some of my user posts) (also, the user
group has been great, but to my point about user stories, my education has
come from posting stories and getting feedback from the community, it would
neat to see this as a first class part of documentation, as I think it
could help folks with Parquet, Drill and optimizing their environment.)

Wish I could be of more help beyond +1 :)



On Sun, Nov 1, 2015 at 1:48 AM, Stefán Baxter <[email protected]>
wrote:

> So we are off to a flying start :)
>
> On Thu, Oct 29, 2015 at 9:50 PM, Stefán Baxter <[email protected]>
> wrote:
>
> > Hi,
> >
> > We are using Avro, JSON and Parquet for collection various types of data
> > for analytical processing.
> >
> > I have not used Parquet before we starting to play around with Drill and
> > now I'm wondering if we are planing our data structures correctly and if
> we
> > will be able to get the most out of Drill+Parquet.
> >
> > I have some questions and I hope the answers can be turned into a Best
> > Practices document.
> >
> > So here we go:
> >
> >    - Are there any rules that we must abide by to make scanning of
> >    "low-cardinality" columns as effective as they can be?
> >    - I understand it so that the Parquet dictionary is scanned for the
> >    value(s) and if they are not in the dictionary that the section is
> ignored
> >
> >    - Can dictionary based scanning (as described above) work on arrays?
> >    - like: {"some":"simple","tags":["blue","green","yellow"]}
> >
> >    - If I have multiple files containing a days worth of logging, in
> >    chronological order, will all the irrelevant files be ignored when
> looking
> >    for a data or a date range?
> >    - AKA - Will the min-max headers in Parquet be used to prevent
> >    scanning of data outside the range?
> >
> >    - Is there anything I need to do to make sure that the write
> >    optimizations in Parquet are used?
> >    - dictionaries for low cardinality fields
> >    - "number folding" for numerical sequences
> >    - compression etc.
> >
> >    - Are there any Parquet features that are not available in Parquet?
> >    - I know Drill is using a fork of Parquet and I wonder if any major
> >    improvements in parquet are unavailable
> >
> >    - Storing Dates with timezone information (stored in two separate
> >    fields?)
> >    - What is the common approach?
> >
> >    - Are there any caveats in converting Avro to Parquet?
> >    - other than to convert unix dates from Avor (only long
> >    available) into timsetamp fields in Parquet
> >
> >
> > There will, in all likelihood, be future installment to this entry as new
> > questions arise.
> >
> > All help is appreciated.
> >
> > Regards,
> >  -Stefan
> >
> >
> >
>

Reply via email to