Hi Divya, Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The idea would be to create new Parquet files into an existing partition structure. That feature has not yet been started. So, the workarounds provided might help you for now.
- Paul > On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <[email protected]> > wrote: > > Does Drill provide that kind of functionality? Theoretically yes. CTAS > should work. But your cluster has to be sized. But I would never put > something in such a pipeline without adequate testing. And I would always > consider a lambda architecture to ensure that if this path were to fail > (with Drill or any other combination of tools), you can recover from the > failure. Each failure that you have puts you behind. If you have several > failures, you will be backlogged and need a mechanism to catch up. > > For data growth, you would need to go back to the source of the data and > estimate the row cardinality. If this is coming from a OLTP system, then it > is related to volume of transactions in the business process. If you do not > understand that load, your system will eventually start failing in the > future with Drill or otherwise. > > Sizing and testing. Just do it. > > Thanks, > Saurabh > > > > On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot <[email protected]> > wrote: > >> The data size is not big for every hour but data size will grow with the >> time say if I have data for 2 years and data is coming on hourly basis and >> everytime creating the paruqet table is not the feasible solution . >> Likewise for hive create the partition and insert the data into partition >> accordingly . >> Was lookiing for that kind of solution. >> Does Drill provides that kind of functionalty ? >> >> Thanks, >> Divya >> >> >> On 26 July 2017 at 15:04, Saurabh Mahapatra <[email protected]> >> wrote: >> >>> I always recommend against using CTAS as a shortcut for a ETL type large >>> workload. You will need to size your Drill cluster accordingly. Consider >>> using Hive or Spark instead. >>> >>> What are the source file formats? For every hour, what is the size and the >>> number of rows for that data? Are you doing any aggregations? And what is >>> the lag between the streaming data and data available for analytics that >>> you are willing to tolerate? >>> >>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli < >>> [email protected]> wrote: >>> >>>> I am not aware of any clean way to do this. However if your data is >>>> partitioned based on directories, then you can use the below hack which >>>> leverages temporary tables [1]. Essentially, you backup your partition >>> to a >>>> temp table, then override it by taking the union of new partition data >>> and >>>> existing partition data. This way we are not over-writing the entire >>> table. >>>> >>>> create temporary table mytable_2017 (col1, col2....) as select col1, >>> col2, >>>> ......from mytable where dir0 = "2017"; >>>> drop table `mytable/2017`; >>>> create table `mytable/2017` as >>>> select col1, col2 .........from new_partition_data >>>> union >>>> select col1, col2 ......... from mytable_2017; >>>> drop table mytable_2017; >>>> >>>> Caveat : Temporary tables get dropped automatically if the session ends >>> or >>>> the drillbit crashes. In the above sequence, if the connection gets >>> dropped >>>> (there are known issues causing this) between the client and drillbit >>> after >>>> executing the "DROP" statement, then your partition data is lost >>> forever. >>>> And since drill doesn't support transactions, the mentioned approach is >>>> dangerous. >>>> >>>> [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/ >>>> >>>> >>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <[email protected] >>>> >>>> wrote: >>>> >>>>> Hi, >>>>> I am naive to Apache drill. >>>>> As I have data coming in every hour , when I searched I couldnt find >>> the >>>>> insert into partition command in Apache drill. >>>>> How can we insert data to particular partition without rewriting the >>>> whole >>>>> data set ? >>>>> >>>>> >>>>> Appreciate the help. >>>>> Thanks, >>>>> Divya >>>>> >>>> >>> >> >>
