Re: append data to already existing table saved in parquet format

Divya Gehlot Wed, 26 Jul 2017 18:59:34 -0700

Yes Paul I am looking for the insert into partition feature .
In this way we just have to create the file for that particular partition
when new data comes in or any updation if its required .
Else every time when data comes in have run the view and recreate the
parquet files for whole data set which is very time consuming specially
when your data is being visualized in some real time dashboard .


Thanks,
Divya

On 27 July 2017 at 08:40, Paul Rogers <[email protected]> wrote:

> Hi Divya,
>
> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
> idea would be to create new Parquet files into an existing partition
> structure. That feature has not yet been started. So, the workarounds
> provided might help you for now.
>
> - Paul
>
> > On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> [email protected]> wrote:
> >
> > Does Drill provide that kind of functionality? Theoretically yes. CTAS
> > should work. But your cluster has to be sized. But I would never put
> > something in such a pipeline without adequate testing. And I would always
> > consider a lambda architecture to ensure that if this path were to fail
> > (with Drill or any other combination of tools), you can recover from the
> > failure. Each failure that you have puts you behind. If you have several
> > failures, you will be backlogged and need a mechanism to catch up.
> >
> > For data growth, you would need to go back to the source of the data and
> > estimate the row cardinality. If this is coming from a OLTP system, then
> it
> > is related to volume of transactions in the business process. If you do
> not
> > understand that load, your system will eventually start failing in the
> > future with Drill or otherwise.
> >
> > Sizing and testing. Just do it.
> >
> > Thanks,
> > Saurabh
> >
> >
> >
> > On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot <[email protected]>
> > wrote:
> >
> >> The data size is not big for every hour but  data size will grow with
> the
> >> time say if I have data for 2 years and data is coming on hourly basis
> and
> >> everytime creating the paruqet table is not the feasible solution .
> >> Likewise for hive create the partition and insert the data into
> partition
> >> accordingly .
> >> Was lookiing for that kind of solution.
> >> Does Drill provides that kind of functionalty ?
> >>
> >> Thanks,
> >> Divya
> >>
> >>
> >> On 26 July 2017 at 15:04, Saurabh Mahapatra <
> [email protected]>
> >> wrote:
> >>
> >>> I always recommend against using CTAS as a shortcut for a ETL type
> large
> >>> workload. You will need to size your Drill cluster accordingly.
> Consider
> >>> using Hive or Spark instead.
> >>>
> >>> What are the source file formats? For every hour, what is the size and
> the
> >>> number of rows for that data? Are you doing any aggregations? And what
> is
> >>> the lag between the streaming data and data available for analytics
> that
> >>> you are willing to tolerate?
> >>>
> >>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
> >>> [email protected]> wrote:
> >>>
> >>>> I am not aware of any clean way to do this. However if your data is
> >>>> partitioned based on directories, then you can use the below hack
> which
> >>>> leverages temporary tables [1]. Essentially, you backup your partition
> >>> to a
> >>>> temp table, then override it by taking the union of new partition data
> >>> and
> >>>> existing partition data. This way we are not over-writing the entire
> >>> table.
> >>>>
> >>>> create temporary table mytable_2017 (col1, col2....)  as select col1,
> >>> col2,
> >>>> ......from mytable where dir0 = "2017";
> >>>> drop table `mytable/2017`;
> >>>> create table `mytable/2017` as
> >>>>    select col1, col2 .........from new_partition_data
> >>>>    union
> >>>>    select col1, col2 ......... from mytable_2017;
> >>>> drop table mytable_2017;
> >>>>
> >>>> Caveat : Temporary tables get dropped automatically if the session
> ends
> >>> or
> >>>> the drillbit crashes. In the above sequence, if the connection gets
> >>> dropped
> >>>> (there are known issues causing this) between the client and drillbit
> >>> after
> >>>> executing the "DROP" statement, then your partition data is lost
> >>> forever.
> >>>> And since drill doesn't support transactions, the mentioned approach
> is
> >>>> dangerous.
> >>>>
> >>>> [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
> >>>>
> >>>>
> >>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <
> [email protected]
> >>>>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>> I am naive to Apache drill.
> >>>>> As I have data coming in every hour , when I searched I couldnt find
> >>> the
> >>>>> insert into partition command in Apache drill.
> >>>>> How can we insert data to particular partition without rewriting the
> >>>> whole
> >>>>> data set ?
> >>>>>
> >>>>>
> >>>>> Appreciate the help.
> >>>>> Thanks,
> >>>>> Divya
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: append data to already existing table saved in parquet format

Reply via email to