I have done something similar. Each directory represents a year and the
content of each directory is a parquet file. I SELECT at the directory level
("FROM dfs.otp"). As I added directories DRILL just figured it out. I didn't
have to do anything special. When I created the parquet files I used the
PARTITION BY `YEAR`. I can now use dir0 = 'year' or `year` = 2008 to get
pruning.
dan@ubuntu:~/otp$ ls
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
dan@ubuntu:~/otp$ ls 2000
0_0_1.parquet
Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com
-----Original Message-----
From: Divya Gehlot [mailto:[email protected]]
Sent: Thursday, July 27, 2017 1:56 AM
To: [email protected]
Subject: Re: append data to already existing table saved in parquet format
Hi Paul,
Let my try your approach of CTAS and save to partition directory structure .
Thanks for the suggestion.
Thanks,
Divya
On 27 July 2017 at 11:57, Paul Rogers <[email protected]> wrote:
> Hi All,
>
> Saurabh, you are right. But, since Parquet does not allow appending to
> existing files, we have to do the logical equivalent which is to
> create a new Parquet file. For it to be part of the same “table” it
> must be part of an existing partition structure as Divya described.
>
> The trick here is to choose a proper time grain. Too small and you end
> up with a very large number of files, and performance will suffer.
> (Once a second, for example, is too frequent.) Too slow and people
> don’t get near-real-time results. But, with the hour grain, Divya is
> using, then the number of files will not be too large per directory,
> and each file might be of a reasonable size.
>
> Using Kafka to batch data would be a fine idea.
>
> Of course, this is still not as good as Saurabh's former project,
> Druid, which builds aggregated cubes on the fly and has a lambda
> architecture to allow querying both immediate and historical data.
> Still, Divya’s design can work fine for some use cases when latency is
> not an issue and data volume is reasonable.
>
> It would help if Drill had INSERT INTO support. But, I wonder, can it
> be made to work with Drill today? Perhaps the query can simply include
> the proper target directory in the CTAS statement. That is, data for
> 2017-07-25
> 02:00 would go into “2017/07/26/2000.parquet”, say. That is,
> do-it-yourself partitioning. I hope Drill won’t care how the Parquet
> files got into the directories, only that the directories have the
> expected structure. (Is this accurate? Haven’t tried it myself…)
>
> With single-threaded, hourly updates, there is no worry about the name
> collisions and other tricky issues that INSERT INTO will have to solve.
>
> Divya, have you tried this solution?
>
> Thanks,
>
> - Paul
>
> > On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra <
> [email protected]> wrote:
> >
> > But append only means you are adding event record to a table(forget
> > the
> layout for a while). That means you have to write to the end of a
> table. If the writes are too many, you have to batch them and then
> convert them into a column format.
> >
> > This to me sounds like a Kafka workflow where you keeping ingesting
> event data, then batch process it ( or stream process it). Writing or
> appending to a columnar store when you data is in a row like format
> does not sound efficient at all. I have not seen such a design in
> systems that actually work. I know there are query engines that try to
> do that but the use is limited. You cannot scale.
> >
> > I always think of Parquet or a columnar data store as the repository
> > of
> historical data that came from the OLTP world. You do not want to
> touch it once you created it. You want to have a strategy where you
> batch the recent data, create the historical data and move on.
> >
> > My 2 cents.
> >
> > Saurabh
> >
> > On Jul 26, 2017, at 6:58 PM, Divya Gehlot <[email protected]>
> wrote:
> >
> > Yes Paul I am looking for the insert into partition feature .
> > In this way we just have to create the file for that particular
> > partition when new data comes in or any updation if its required .
> > Else every time when data comes in have run the view and recreate
> > the parquet files for whole data set which is very time consuming
> > specially when your data is being visualized in some real time dashboard .
> >
> > Thanks,
> > Divya
> >
> >> On 27 July 2017 at 08:40, Paul Rogers <[email protected]> wrote:
> >>
> >> Hi Divya,
> >>
> >> Seems that you are asking for an “INSERT INTO” feature
> >> (DRILL-3534). The idea would be to create new Parquet files into an
> >> existing partition structure. That feature has not yet been
> >> started. So, the workarounds provided might help you for now.
> >>
> >> - Paul
> >>
> >>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> >> [email protected]> wrote:
> >>>
> >>> Does Drill provide that kind of functionality? Theoretically yes.
> >>> CTAS should work. But your cluster has to be sized. But I would
> >>> never put something in such a pipeline without adequate testing.
> >>> And I would
> always
> >>> consider a lambda architecture to ensure that if this path were to
> >>> fail (with Drill or any other combination of tools), you can
> >>> recover from
> the
> >>> failure. Each failure that you have puts you behind. If you have
> several
> >>> failures, you will be backlogged and need a mechanism to catch up.
> >>>
> >>> For data growth, you would need to go back to the source of the
> >>> data
> and
> >>> estimate the row cardinality. If this is coming from a OLTP
> >>> system,
> then
> >> it
> >>> is related to volume of transactions in the business process. If
> >>> you do
> >> not
> >>> understand that load, your system will eventually start failing in
> >>> the future with Drill or otherwise.
> >>>
> >>> Sizing and testing. Just do it.
> >>>
> >>> Thanks,
> >>> Saurabh
> >>>
> >>>
> >>>
> >>> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot
> >>> <[email protected]
> >
> >>> wrote:
> >>>
> >>>> The data size is not big for every hour but data size will grow
> >>>> with
> >> the
> >>>> time say if I have data for 2 years and data is coming on hourly
> >>>> basis
> >> and
> >>>> everytime creating the paruqet table is not the feasible solution .
> >>>> Likewise for hive create the partition and insert the data into
> >> partition
> >>>> accordingly .
> >>>> Was lookiing for that kind of solution.
> >>>> Does Drill provides that kind of functionalty ?
> >>>>
> >>>> Thanks,
> >>>> Divya
> >>>>
> >>>>
> >>>> On 26 July 2017 at 15:04, Saurabh Mahapatra <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> I always recommend against using CTAS as a shortcut for a ETL
> >>>>> type
> >> large
> >>>>> workload. You will need to size your Drill cluster accordingly.
> >> Consider
> >>>>> using Hive or Spark instead.
> >>>>>
> >>>>> What are the source file formats? For every hour, what is the
> >>>>> size
> and
> >> the
> >>>>> number of rows for that data? Are you doing any aggregations?
> >>>>> And
> what
> >> is
> >>>>> the lag between the streaming data and data available for
> >>>>> analytics
> >> that
> >>>>> you are willing to tolerate?
> >>>>>
> >>>>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> I am not aware of any clean way to do this. However if your
> >>>>>> data is partitioned based on directories, then you can use the
> >>>>>> below hack
> >> which
> >>>>>> leverages temporary tables [1]. Essentially, you backup your
> partition
> >>>>> to a
> >>>>>> temp table, then override it by taking the union of new
> >>>>>> partition
> data
> >>>>> and
> >>>>>> existing partition data. This way we are not over-writing the
> >>>>>> entire
> >>>>> table.
> >>>>>>
> >>>>>> create temporary table mytable_2017 (col1, col2....) as select
> col1,
> >>>>> col2,
> >>>>>> ......from mytable where dir0 = "2017"; drop table
> >>>>>> `mytable/2017`; create table `mytable/2017` as select col1,
> >>>>>> col2 .........from new_partition_data union select col1, col2
> >>>>>> ......... from mytable_2017; drop table mytable_2017;
> >>>>>>
> >>>>>> Caveat : Temporary tables get dropped automatically if the
> >>>>>> session
> >> ends
> >>>>> or
> >>>>>> the drillbit crashes. In the above sequence, if the connection
> >>>>>> gets
> >>>>> dropped
> >>>>>> (there are known issues causing this) between the client and
> drillbit
> >>>>> after
> >>>>>> executing the "DROP" statement, then your partition data is
> >>>>>> lost
> >>>>> forever.
> >>>>>> And since drill doesn't support transactions, the mentioned
> >>>>>> approach
> >> is
> >>>>>> dangerous.
> >>>>>>
> >>>>>> [1]
> >>>>>> https://drill.apache.org/docs/create-temporary-table-as-cttas/
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <
> >> [email protected]
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>> I am naive to Apache drill.
> >>>>>>> As I have data coming in every hour , when I searched I
> >>>>>>> couldnt
> find
> >>>>> the
> >>>>>>> insert into partition command in Apache drill.
> >>>>>>> How can we insert data to particular partition without
> >>>>>>> rewriting
> the
> >>>>>> whole
> >>>>>>> data set ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Appreciate the help.
> >>>>>>> Thanks,
> >>>>>>> Divya
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>