Hi Saurabh,
If I know that the likelihood of partitioning by
location is 60%, then I will create a nested directory structure with
time(month) at top of the hierarchy and location just below it.
You mean first have to create the partition directory by year,month and
then create a nested directory structure
Somthing like as shown below
/path/to/directory
/country1
/datafiles.parquet
/country2
datafiles.parquet
Thanks,
Divya
On 26 July 2017 at 03:21, Saurabh Mahapatra <[email protected]>
wrote:
> Hi Divya,
>
> There is nothing as a naive question. Please feel free to post any
> questions you have. There is someone in the community that will help you
> out.
>
> This is my opinion:
> There are a variety of BI tools in the market that offer excellent
> visualization and interaction with data capabilities. Tableau,
> MicroStrategy, Qlik to name a few. These are tools built by companies but
> you could be building your own web app that is highly customized for your
> users. The need for such tools as arisen for the need of the end BI
> (business intelligence) user who does not have the time and patience of
> type SQL queries. If you are slicing and dicing data while following your
> intuition, imagine having to rewrite the SQL queries each time and ensuring
> that they work syntactically.
>
> That is a lot to ask for the average user who wants to look at the data in
> different ways and make a decision that hopefully results in some
> action(and not just powerpoint slides). The latter is more important than
> anything else inside a company.
>
> Drill provides the SQL query layer for interaction with the data underneath
> scattered across various data sources.
>
> So before you jump in to standardize on any BI tool-ask yourself: who are
> the business users, what are their needs in terms of decision making and
> what kind of workflows do they envision. Then work backwards to find out
> the tool that best meets your needs. Be open to the idea of building your
> web app if that is something you envision will benefit your users in the
> long term.
>
> As for the other questions, these are the ones related to data retrieval
> (efficiency which results in performance). I will tell you what I know and
> other can chime in with better info:
>
> 1. Metadata cache: Only for Parquet files. The idea here is to store the
> metadata associated with Parquet rowgroups per file in a separate file so
> that you avoid having to open and close every Parquet file to get that
> info. Metadata can help you understand basic statistics such as mins and
> maxes so that you can skip rowgroups or files that do not match your filter
> condition. This idea of storing metadata is not new across other query
> engines.
>
> Read more here:
> https://drill.apache.org/docs/optimizing-parquet-metadata-reading/
>
> 2. Partitioning: Drill is "directory aware". This is an age old concept of
> partitioning your data in a way so that Drill can skip directories that are
> part of the filter condition. The layout and structuring of the data now
> helps Drill. Partitioning schemes depend on query patterns. One rule of
> thumb that I use is to look at the BI users and observe their workflows. If
> they use a time range as the basis of every analysis, then I will partition
> by time (say month). If I know that the likelihood of partitioning by
> location is 60%, then I will create a nested directory structure with
> time(month) at top of the hierarchy and location just below it.
>
> Read more here:
> https://drill.apache.org/docs/partition-pruning-introduction/
>
> 3. Generation of Parquet file
>
> https://drill.apache.org/docs/parquet-format/
>
> Please pay attention to how you configure the writer:
> https://drill.apache.org/docs/parquet-format/#configuring-
> the-parquet-storage-format
>
> 4. Custom column calculation:
> There is some out of the box stuff here:
> https://drill.apache.org/docs/sql-window-functions-introduction/
>
> But you should also be aware of nesting operations as well:
> https://drill.apache.org/docs/nested-data-limitations/
>
> and of course there are UDFs:
> https://drill.apache.org/docs/adding-custom-functions-to-
> drill-introduction/
>
> Please let us know if you have any additional questions.
>
> Happy drilling:)
>
> Saurabh
>
>
>
>
>
>
>
> On Tue, Jul 25, 2017 at 12:54 AM, Divya Gehlot <[email protected]>
> wrote:
>
> > Hi,
> > As a naive user would like to know the benefitsof Apache Drill with
> tableau
> > ?
> >
> > As per my understanding we to visualize we need to push the data to
> tableau
> > for granular visualization .
> >
> > Would like to understand few features of Drill in terms of visualtion or
> > data retrieval :
> > 1, Metadata Caching
> > 2 .Partitioning
> > 3.Generation of Parquet File
> > 4.Custom column calculation.
> >
> >
> > Thanks,
> > Divya
> >
>