Re: append data to already existing table saved in parquet format

Paul Rogers Wed, 26 Jul 2017 17:41:02 -0700

Hi Divya,

Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The idea 
would be to create new Parquet files into an existing partition structure. That 
feature has not yet been started. So, the workarounds provided might help you 
for now.


- Paul

> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <[email protected]> 
> wrote:
> 
> Does Drill provide that kind of functionality? Theoretically yes. CTAS
> should work. But your cluster has to be sized. But I would never put
> something in such a pipeline without adequate testing. And I would always
> consider a lambda architecture to ensure that if this path were to fail
> (with Drill or any other combination of tools), you can recover from the
> failure. Each failure that you have puts you behind. If you have several
> failures, you will be backlogged and need a mechanism to catch up.
> 
> For data growth, you would need to go back to the source of the data and
> estimate the row cardinality. If this is coming from a OLTP system, then it
> is related to volume of transactions in the business process. If you do not
> understand that load, your system will eventually start failing in the
> future with Drill or otherwise.
> 
> Sizing and testing. Just do it.
> 
> Thanks,
> Saurabh
> 
> 
> 
> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot <[email protected]>
> wrote:
> 
>> The data size is not big for every hour but  data size will grow with the
>> time say if I have data for 2 years and data is coming on hourly basis and
>> everytime creating the paruqet table is not the feasible solution .
>> Likewise for hive create the partition and insert the data into partition
>> accordingly .
>> Was lookiing for that kind of solution.
>> Does Drill provides that kind of functionalty ?
>> 
>> Thanks,
>> Divya
>> 
>> 
>> On 26 July 2017 at 15:04, Saurabh Mahapatra <[email protected]>
>> wrote:
>> 
>>> I always recommend against using CTAS as a shortcut for a ETL type large
>>> workload. You will need to size your Drill cluster accordingly. Consider
>>> using Hive or Spark instead.
>>> 
>>> What are the source file formats? For every hour, what is the size and the
>>> number of rows for that data? Are you doing any aggregations? And what is
>>> the lag between the streaming data and data available for analytics that
>>> you are willing to tolerate?
>>> 
>>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
>>> [email protected]> wrote:
>>> 
>>>> I am not aware of any clean way to do this. However if your data is
>>>> partitioned based on directories, then you can use the below hack which
>>>> leverages temporary tables [1]. Essentially, you backup your partition
>>> to a
>>>> temp table, then override it by taking the union of new partition data
>>> and
>>>> existing partition data. This way we are not over-writing the entire
>>> table.
>>>> 
>>>> create temporary table mytable_2017 (col1, col2....)  as select col1,
>>> col2,
>>>> ......from mytable where dir0 = "2017";
>>>> drop table `mytable/2017`;
>>>> create table `mytable/2017` as
>>>>    select col1, col2 .........from new_partition_data
>>>>    union
>>>>    select col1, col2 ......... from mytable_2017;
>>>> drop table mytable_2017;
>>>> 
>>>> Caveat : Temporary tables get dropped automatically if the session ends
>>> or
>>>> the drillbit crashes. In the above sequence, if the connection gets
>>> dropped
>>>> (there are known issues causing this) between the client and drillbit
>>> after
>>>> executing the "DROP" statement, then your partition data is lost
>>> forever.
>>>> And since drill doesn't support transactions, the mentioned approach is
>>>> dangerous.
>>>> 
>>>> [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
>>>> 
>>>> 
>>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <[email protected]
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> I am naive to Apache drill.
>>>>> As I have data coming in every hour , when I searched I couldnt find
>>> the
>>>>> insert into partition command in Apache drill.
>>>>> How can we insert data to particular partition without rewriting the
>>>> whole
>>>>> data set ?
>>>>> 
>>>>> 
>>>>> Appreciate the help.
>>>>> Thanks,
>>>>> Divya
>>>>> 
>>>> 
>>> 
>> 
>>

Re: append data to already existing table saved in parquet format

Reply via email to