Re: Drill and Tableau

Saurabh Mahapatra Mon, 31 Jul 2017 16:24:55 -0700

Hi Divya,

There are a couple of factors that play into how partition pruning helps.
(1) Access pattern based on the frequency of dimensions (and their number)
being accessed.
(2) Data distribution across the dimensions listed in (1). If there is
skew, partition pruning can be of no use.
(3) Cardinality of the dimensions listed in (1)
(4) Partition directory hierarchy i.e. ordering

In your case, it would be enough to assume the directory structure listed.
However, I recommend reordering the directory structure (put location on
top and time below it) and do the testing. Data layouts need to be
experimented with.

Instead of hand-waving, I decided to do a mental exercise to show you how I
think about these problems.

Let us do a thought exercise and make the following assumptions. Assume the
simplest query engine that returns all the rows based on a filter
condition. It does no optimization on filters but just scans the records to
see if a filter condition is met or not. Also assume that you have limited
cluster resources to scan these records (does not matter how many ideal
parallel machines you have, the resources are not infinite).

(A) Probability of dimensions in a query (assume that they are independent
but even with dependency you can find out using Bayesian inference rules)
1. Probability of time dimension in query: 90%
2. Probability of location dimension in query: 60%

(B) Number of records and baseline
Let us assume that the number of records in a single Parquet file is
1,000,000,000. (1 billion). Without partition pruning, you would be
scanning 1 billion records regardless of the filter condition. The data
could be there in any order and so the worst case measurement makes sense.

(C) Assume the cardinality of the directory partitions (or the cardinality
of the dimensions themselves)
Time dimension: 1000
Location dimension: 10

(D) Assume that there is no data skew and a uniform distribution for all
records. This means that 1 million records would be uniformly distributed
across 1,000,000,000/1000=1 million records per time partition. And within
each partition, you have 1 million record/10=100,000 records in each
location partition.

If the filter predicate contains a maximum of 10 conditions on subset of
the time dimensions, the worst case of cost of access: 10* 100,000=1
million records. i.e. you will access 10 folders and that should be
multiplied by number of records.

If there is a second filter predicate that contains a maximum of 5
conditions on the location, the worst case cost of access is reduced
further to 5*100,000 records.

Any query coming into the system has four possibilities with respect to
what is there in the filter predicate:
1. No time, no location dimension. The probability of this happening is
(1-0.9)*(1-0.6)=0.1*0.4=0.04.

Worst case cost of access is 0.04*1,000,000,000=40,000,000=40 million.

2. Only time, no location dimension: The probability of this happening is
0.9*0.4=0.36
Worst case cost of access is 0.36* 1 million=360,000 records=0.36 million

3. Only location, no time dimension: The probability of this happening is
0.6*0.1=0.06
Worst case cost of access is 0.06*5*100,000*1000 (for all time folders)=30
million

4. Both location and time dimension present: The probability of this
happening is 0.54. This is important because this tells you that you have
just over 50% chance of these dimensions present in the filter.

Worst case cost of access is: 0.54*10*5*100,000=2,700,000

Add them all up and you have the worst case cost of access as: 40
million+0.36 million+ 30 million+2.7 million=73.6 million. So you have
reduced the amount of scan by ~92% improvement (10x improvement) by
partition pruning based on an access pattern.

Now, let us consider the second case when you change the directory
structure to have location partition at the top and time dimension just
below it.

There are 10 location directories with each containing: 1 billion/10=100
million records.
Within each location directory, there are 100 time folders and so we have:
100,000,000/100=1 million records.

Any query coming into the system has four possibilities with respect to
what is there in the filter predicate:
1. No time, no location dimension. The probability of this happening is
(1-0.9)*(1-0.6)=0.1*0.4=0.04.

Worst case cost of access is 0.04*1,000,000,000=40,000,000=40 million.

2. Only time, no location dimension: The probability of this happening is
0.9*0.4=0.36
Worst case cost of access is 0.36*10*10*1 million=36 million

3. Only location, no time dimension: The probability of this happening is
0.6*0.1=0.06
Worst case cost of access is 0.06*5*100 million=30 million

4. Both location and time dimension present: The probability of this
happening is 0.54. This is important because this tells you that you have
just over 50% chance of these dimensions present in the filter.

Worst case cost of access is: 0.54* 5*10*100,000=2,700,000. No change here.

Add them all up and you have the worst case cost of access as: 40
million+36 million+ 30 million+2.7 million=108 million. So you have reduced
the amount of scan by ~89% improvement (still 10x improvement) by partition
pruning based on an access pattern. It has reduced a little because of the
change in the directory structure.

Thanks,
Saurabh

On Sun, Jul 30, 2017 at 7:15 PM, Divya Gehlot <[email protected]>
wrote:

> Hi Saurabh,
>
> If I know that the likelihood of partitioning by
> location is 60%, then I will create a nested directory structure with
> time(month) at top of the hierarchy and location just below it.
>
> You mean first have to create the partition directory by year,month and
> then create a nested directory structure
> Somthing like as shown below
> /path/to/directory
>              /country1
>                /datafiles.parquet
>             /country2
>              datafiles.parquet
>
> Thanks,
> Divya
>
>
>
> On 26 July 2017 at 03:21, Saurabh Mahapatra <[email protected]>
> wrote:
>
> > Hi Divya,
> >
> > There is nothing as a naive question. Please feel free to post any
> > questions you have. There is someone in the community that will help you
> > out.
> >
> > This is my opinion:
> > There are a variety of BI tools in the market that offer excellent
> > visualization and interaction with data capabilities. Tableau,
> > MicroStrategy, Qlik to name a few. These are tools built by companies but
> > you could be building your own web app that is highly customized for your
> > users. The need for such tools as arisen for the need of the end BI
> > (business intelligence) user who does not have the time and patience of
> > type SQL queries. If you are slicing and dicing data while following your
> > intuition, imagine having to rewrite the SQL queries each time and
> ensuring
> > that they work syntactically.
> >
> > That is a lot to ask for the average user who wants to look at the data
> in
> > different ways and make a decision that hopefully results in some
> > action(and not just powerpoint slides). The latter is more important than
> > anything else inside a company.
> >
> > Drill provides the SQL query layer for interaction with the data
> underneath
> > scattered across various data sources.
> >
> > So before you jump in to standardize on any BI tool-ask yourself: who are
> > the business users, what are their needs in terms of decision making and
> > what kind of workflows do they envision. Then work backwards to find out
> > the tool that best meets your needs. Be open to the idea of building your
> > web app if that is something you envision will benefit your users in the
> > long term.
> >
> > As for the other questions, these are the ones related to data retrieval
> > (efficiency which results in performance). I will tell you what I know
> and
> > other can chime in with better info:
> >
> > 1. Metadata cache: Only for Parquet files. The idea here is to store the
> > metadata associated with Parquet rowgroups per file in a separate file so
> > that you avoid having to open and close every Parquet file to get that
> > info. Metadata can help you understand basic statistics such as mins and
> > maxes so that you can skip rowgroups or files that do not match your
> filter
> > condition. This idea of storing metadata is not new across other query
> > engines.
> >
> > Read more here:
> > https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
> >
> > 2. Partitioning: Drill is "directory aware". This is an age old concept
> of
> > partitioning your data in a way so that Drill can skip directories that
> are
> > part of the filter condition. The layout and structuring of the data now
> > helps Drill. Partitioning schemes depend on query patterns. One rule of
> > thumb that I use is to look at the BI users and observe their workflows.
> If
> > they use a time range as the basis of every analysis, then I will
> partition
> > by time (say month). If I know that the likelihood of partitioning by
> > location is 60%, then I will create a nested directory structure with
> > time(month) at top of the hierarchy and location just below it.
> >
> > Read more here:
> > https://drill.apache.org/docs/partition-pruning-introduction/
> >
> > 3. Generation of Parquet file
> >
> > https://drill.apache.org/docs/parquet-format/
> >
> > Please pay attention to how you configure the writer:
> > https://drill.apache.org/docs/parquet-format/#configuring-
> > the-parquet-storage-format
> >
> > 4. Custom column calculation:
> > There is some out of the box stuff here:
> > https://drill.apache.org/docs/sql-window-functions-introduction/
> >
> > But you should also be aware of nesting operations as well:
> > https://drill.apache.org/docs/nested-data-limitations/
> >
> > and of course there are UDFs:
> > https://drill.apache.org/docs/adding-custom-functions-to-
> > drill-introduction/
> >
> > Please let us know if you have any additional questions.
> >
> > Happy drilling:)
> >
> > Saurabh
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Jul 25, 2017 at 12:54 AM, Divya Gehlot <[email protected]>
> > wrote:
> >
> > > Hi,
> > > As a naive user would like to know the benefitsof Apache Drill with
> > tableau
> > > ?
> > >
> > > As per my understanding we to visualize we need to push the data to
> > tableau
> > > for granular visualization .
> > >
> > > Would like to understand few features of Drill in terms of visualtion
> or
> > > data retrieval :
> > > 1, Metadata Caching
> > > 2 .Partitioning
> > > 3.Generation of Parquet File
> > > 4.Custom column calculation.
> > >
> > >
> > > Thanks,
> > > Divya
> > >
> >
>

Re: Drill and Tableau

Reply via email to