RE: append data to already existing table saved in parquet format

Dan Holmes Thu, 27 Jul 2017 05:30:00 -0700

I have done something similar.  Each directory represents a year and the 
content of each directory is a parquet file.  I SELECT at the directory level 
("FROM dfs.otp").  As I added directories DRILL just figured it out.  I didn't 
have to do anything special.  When I created the parquet files I used the 
PARTITION BY `YEAR`.  I can now use dir0 = 'year' or `year` = 2008 to get 
pruning.


dan@ubuntu:~/otp$ ls
1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008
dan@ubuntu:~/otp$ ls 2000
0_0_1.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-----Original Message-----
From: Divya Gehlot [mailto:[email protected]] 
Sent: Thursday, July 27, 2017 1:56 AM
To: [email protected]
Subject: Re: append data to already existing table saved in parquet format

Hi Paul,
Let my try your approach of CTAS and save to  partition directory structure .

Thanks for the suggestion.

Thanks,
Divya

On 27 July 2017 at 11:57, Paul Rogers <[email protected]> wrote:

> Hi All,
>
> Saurabh, you are right. But, since Parquet does not allow appending to 
> existing files, we have to do the logical equivalent which is to 
> create a new Parquet file. For it to be part of the same “table” it 
> must be part of an existing partition structure as Divya described.
>
> The trick here is to choose a proper time grain. Too small and you end 
> up with a very large number of files, and performance will suffer. 
> (Once a second, for example, is too frequent.) Too slow and people 
> don’t get near-real-time results. But, with the hour grain, Divya is 
> using, then the number of files will not be too large per directory, 
> and each file might be of a reasonable size.
>
> Using Kafka to batch data would be a fine idea.
>
> Of course, this is still not as good as Saurabh's former project, 
> Druid, which builds aggregated cubes on the fly and has a lambda 
> architecture to allow querying both immediate and historical data. 
> Still, Divya’s design can work fine for some use cases when latency is 
> not an issue and data volume is reasonable.
>
> It would help if Drill had INSERT INTO support. But, I wonder, can it 
> be made to work with Drill today? Perhaps the query can simply include 
> the proper target directory in the CTAS statement. That is, data for 
> 2017-07-25
> 02:00 would go into “2017/07/26/2000.parquet”, say. That is, 
> do-it-yourself partitioning. I hope Drill won’t care how the Parquet 
> files got into the directories, only that the directories have the 
> expected structure. (Is this accurate? Haven’t tried it myself…)
>
> With single-threaded, hourly updates, there is no worry about the name 
> collisions and other tricky issues that INSERT INTO will have to solve.
>
> Divya, have you tried this solution?
>
> Thanks,
>
> - Paul
>
> > On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra <
> [email protected]> wrote:
> >
> > But append only means you are adding event record to a table(forget 
> > the
> layout for a while). That means you have to write to the end of a 
> table. If the writes are too many, you have to batch them and then 
> convert them into a column format.
> >
> > This to me sounds like a Kafka workflow where you keeping ingesting
> event data, then batch process it ( or stream process it). Writing or 
> appending to a columnar store when you data is in a row like format 
> does not sound efficient at all. I have not seen such a design in 
> systems that actually work. I know there are query engines that try to 
> do that but the use is limited. You cannot scale.
> >
> > I always think of Parquet or a columnar data store as the repository 
> > of
> historical data that came from the OLTP world. You do not want to 
> touch it once you created it. You want to have a strategy where you 
> batch the recent data, create the historical data and move on.
> >
> > My 2 cents.
> >
> > Saurabh
> >
> > On Jul 26, 2017, at 6:58 PM, Divya Gehlot <[email protected]>
> wrote:
> >
> > Yes Paul I am looking for the insert into partition feature .
> > In this way we just have to create the file for that particular 
> > partition when new data comes in or any updation if its required .
> > Else every time when data comes in have run the view and recreate 
> > the parquet files for whole data set which is very time consuming 
> > specially when your data is being visualized in some real time dashboard .
> >
> > Thanks,
> > Divya
> >
> >> On 27 July 2017 at 08:40, Paul Rogers <[email protected]> wrote:
> >>
> >> Hi Divya,
> >>
> >> Seems that you are asking for an “INSERT INTO” feature 
> >> (DRILL-3534). The idea would be to create new Parquet files into an 
> >> existing partition structure. That feature has not yet been 
> >> started. So, the workarounds provided might help you for now.
> >>
> >> - Paul
> >>
> >>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> >> [email protected]> wrote:
> >>>
> >>> Does Drill provide that kind of functionality? Theoretically yes. 
> >>> CTAS should work. But your cluster has to be sized. But I would 
> >>> never put something in such a pipeline without adequate testing. 
> >>> And I would
> always
> >>> consider a lambda architecture to ensure that if this path were to 
> >>> fail (with Drill or any other combination of tools), you can 
> >>> recover from
> the
> >>> failure. Each failure that you have puts you behind. If you have
> several
> >>> failures, you will be backlogged and need a mechanism to catch up.
> >>>
> >>> For data growth, you would need to go back to the source of the 
> >>> data
> and
> >>> estimate the row cardinality. If this is coming from a OLTP 
> >>> system,
> then
> >> it
> >>> is related to volume of transactions in the business process. If 
> >>> you do
> >> not
> >>> understand that load, your system will eventually start failing in 
> >>> the future with Drill or otherwise.
> >>>
> >>> Sizing and testing. Just do it.
> >>>
> >>> Thanks,
> >>> Saurabh
> >>>
> >>>
> >>>
> >>> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot 
> >>> <[email protected]
> >
> >>> wrote:
> >>>
> >>>> The data size is not big for every hour but  data size will grow 
> >>>> with
> >> the
> >>>> time say if I have data for 2 years and data is coming on hourly 
> >>>> basis
> >> and
> >>>> everytime creating the paruqet table is not the feasible solution .
> >>>> Likewise for hive create the partition and insert the data into
> >> partition
> >>>> accordingly .
> >>>> Was lookiing for that kind of solution.
> >>>> Does Drill provides that kind of functionalty ?
> >>>>
> >>>> Thanks,
> >>>> Divya
> >>>>
> >>>>
> >>>> On 26 July 2017 at 15:04, Saurabh Mahapatra <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> I always recommend against using CTAS as a shortcut for a ETL 
> >>>>> type
> >> large
> >>>>> workload. You will need to size your Drill cluster accordingly.
> >> Consider
> >>>>> using Hive or Spark instead.
> >>>>>
> >>>>> What are the source file formats? For every hour, what is the 
> >>>>> size
> and
> >> the
> >>>>> number of rows for that data? Are you doing any aggregations? 
> >>>>> And
> what
> >> is
> >>>>> the lag between the streaming data and data available for 
> >>>>> analytics
> >> that
> >>>>> you are willing to tolerate?
> >>>>>
> >>>>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli < 
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> I am not aware of any clean way to do this. However if your 
> >>>>>> data is partitioned based on directories, then you can use the 
> >>>>>> below hack
> >> which
> >>>>>> leverages temporary tables [1]. Essentially, you backup your
> partition
> >>>>> to a
> >>>>>> temp table, then override it by taking the union of new 
> >>>>>> partition
> data
> >>>>> and
> >>>>>> existing partition data. This way we are not over-writing the 
> >>>>>> entire
> >>>>> table.
> >>>>>>
> >>>>>> create temporary table mytable_2017 (col1, col2....)  as select
> col1,
> >>>>> col2,
> >>>>>> ......from mytable where dir0 = "2017"; drop table 
> >>>>>> `mytable/2017`; create table `mytable/2017` as  select col1, 
> >>>>>> col2 .........from new_partition_data  union  select col1, col2 
> >>>>>> ......... from mytable_2017; drop table mytable_2017;
> >>>>>>
> >>>>>> Caveat : Temporary tables get dropped automatically if the 
> >>>>>> session
> >> ends
> >>>>> or
> >>>>>> the drillbit crashes. In the above sequence, if the connection 
> >>>>>> gets
> >>>>> dropped
> >>>>>> (there are known issues causing this) between the client and
> drillbit
> >>>>> after
> >>>>>> executing the "DROP" statement, then your partition data is 
> >>>>>> lost
> >>>>> forever.
> >>>>>> And since drill doesn't support transactions, the mentioned 
> >>>>>> approach
> >> is
> >>>>>> dangerous.
> >>>>>>
> >>>>>> [1] 
> >>>>>> https://drill.apache.org/docs/create-temporary-table-as-cttas/
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <
> >> [email protected]
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I am naive to Apache drill.
> >>>>>>> As I have data coming in every hour , when I searched I 
> >>>>>>> couldnt
> find
> >>>>> the
> >>>>>>> insert into partition command in Apache drill.
> >>>>>>> How can we insert data to particular partition without 
> >>>>>>> rewriting
> the
> >>>>>> whole
> >>>>>>> data set ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate the help.
> >>>>>>> Thanks,
> >>>>>>> Divya
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

RE: append data to already existing table saved in parquet format

Reply via email to