Re: append data to already existing table saved in parquet format

Saurabh Mahapatra Wed, 26 Jul 2017 19:33:07 -0700

But append only means you are adding event record to a table(forget the layout 
for a while). That means you have to write to the end of a table. If the writes 
are too many, you have to batch them and then convert them into a column 
format.


This to me sounds like a Kafka workflow where you keeping ingesting event data, 
then batch process it ( or stream process it). Writing or appending to a 
columnar store when you data is in a row like format does not sound efficient 
at all. I have not seen such a design in systems that actually work. I know 
there are query engines that try to do that but the use is limited. You cannot 
scale. 

I always think of Parquet or a columnar data store as the repository of 
historical data that came from the OLTP world. You do not want to touch it once 
you created it. You want to have a strategy where you batch the recent data, 
create the historical data and move on. 

My 2 cents.

Saurabh

On Jul 26, 2017, at 6:58 PM, Divya Gehlot <[email protected]> wrote:

Yes Paul I am looking for the insert into partition feature .
In this way we just have to create the file for that particular partition
when new data comes in or any updation if its required .
Else every time when data comes in have run the view and recreate the
parquet files for whole data set which is very time consuming specially
when your data is being visualized in some real time dashboard .

Thanks,
Divya

> On 27 July 2017 at 08:40, Paul Rogers <[email protected]> wrote:
> 
> Hi Divya,
> 
> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
> idea would be to create new Parquet files into an existing partition
> structure. That feature has not yet been started. So, the workarounds
> provided might help you for now.
> 
> - Paul
> 
>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> [email protected]> wrote:
>> 
>> Does Drill provide that kind of functionality? Theoretically yes. CTAS
>> should work. But your cluster has to be sized. But I would never put
>> something in such a pipeline without adequate testing. And I would always
>> consider a lambda architecture to ensure that if this path were to fail
>> (with Drill or any other combination of tools), you can recover from the
>> failure. Each failure that you have puts you behind. If you have several
>> failures, you will be backlogged and need a mechanism to catch up.
>> 
>> For data growth, you would need to go back to the source of the data and
>> estimate the row cardinality. If this is coming from a OLTP system, then
> it
>> is related to volume of transactions in the business process. If you do
> not
>> understand that load, your system will eventually start failing in the
>> future with Drill or otherwise.
>> 
>> Sizing and testing. Just do it.
>> 
>> Thanks,
>> Saurabh
>> 
>> 
>> 
>> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot <[email protected]>
>> wrote:
>> 
>>> The data size is not big for every hour but  data size will grow with
> the
>>> time say if I have data for 2 years and data is coming on hourly basis
> and
>>> everytime creating the paruqet table is not the feasible solution .
>>> Likewise for hive create the partition and insert the data into
> partition
>>> accordingly .
>>> Was lookiing for that kind of solution.
>>> Does Drill provides that kind of functionalty ?
>>> 
>>> Thanks,
>>> Divya
>>> 
>>> 
>>> On 26 July 2017 at 15:04, Saurabh Mahapatra <
> [email protected]>
>>> wrote:
>>> 
>>>> I always recommend against using CTAS as a shortcut for a ETL type
> large
>>>> workload. You will need to size your Drill cluster accordingly.
> Consider
>>>> using Hive or Spark instead.
>>>> 
>>>> What are the source file formats? For every hour, what is the size and
> the
>>>> number of rows for that data? Are you doing any aggregations? And what
> is
>>>> the lag between the streaming data and data available for analytics
> that
>>>> you are willing to tolerate?
>>>> 
>>>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
>>>> [email protected]> wrote:
>>>> 
>>>>> I am not aware of any clean way to do this. However if your data is
>>>>> partitioned based on directories, then you can use the below hack
> which
>>>>> leverages temporary tables [1]. Essentially, you backup your partition
>>>> to a
>>>>> temp table, then override it by taking the union of new partition data
>>>> and
>>>>> existing partition data. This way we are not over-writing the entire
>>>> table.
>>>>> 
>>>>> create temporary table mytable_2017 (col1, col2....)  as select col1,
>>>> col2,
>>>>> ......from mytable where dir0 = "2017";
>>>>> drop table `mytable/2017`;
>>>>> create table `mytable/2017` as
>>>>>   select col1, col2 .........from new_partition_data
>>>>>   union
>>>>>   select col1, col2 ......... from mytable_2017;
>>>>> drop table mytable_2017;
>>>>> 
>>>>> Caveat : Temporary tables get dropped automatically if the session
> ends
>>>> or
>>>>> the drillbit crashes. In the above sequence, if the connection gets
>>>> dropped
>>>>> (there are known issues causing this) between the client and drillbit
>>>> after
>>>>> executing the "DROP" statement, then your partition data is lost
>>>> forever.
>>>>> And since drill doesn't support transactions, the mentioned approach
> is
>>>>> dangerous.
>>>>> 
>>>>> [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
>>>>> 
>>>>> 
>>>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <
> [email protected]
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> I am naive to Apache drill.
>>>>>> As I have data coming in every hour , when I searched I couldnt find
>>>> the
>>>>>> insert into partition command in Apache drill.
>>>>>> How can we insert data to particular partition without rewriting the
>>>>> whole
>>>>>> data set ?
>>>>>> 
>>>>>> 
>>>>>> Appreciate the help.
>>>>>> Thanks,
>>>>>> Divya
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
> 
>

Re: append data to already existing table saved in parquet format

Reply via email to