Re: Avoiding running Recover partitions very frequent

Fawze Abujaber Fri, 17 Aug 2018 02:12:02 -0700

Hello,

Which one more appropriate for my case ( Spark job that write new files in
a partitioned HDFS), ALTER TABLE recover partitions or the REFRESH
statement( on specific partition)?


as i'm running 4800 statement in a day and we would like to run the best
practises for my case.

On Mon, Aug 13, 2018 at 5:00 PM Fawze Abujaber <fawz...@gmail.com> wrote:

> Hi Jezy,
>
> My problem that it is happening with specific events and not a sporadic,
> and this is what let me eliminate the suspect that i may have a bottleneck
> at the NameNode or the Catalog Server.
>
> On Mon, Aug 13, 2018 at 4:52 PM Jeszy <jes...@gmail.com> wrote:
>
>> I'd try to trace the update through catalog and statestore. SYNC_DDL=1
>> can be a problem especially if there's a slow impalad or a lot of
>> catalog updates concurrently (lot of data to stream from statestore
>> node). Namenode can also become a bottleneck. Catalog logs will help
>> point these out.
>>
>> On 13 August 2018 at 14:30, Fawze Abujaber <fawz...@gmail.com> wrote:
>> > Thanks Jezy for your quick response, We are far away from moving to
>> Kudu.
>> >
>> > Trying to figure out what can cause Recover partition to run for a long
>> time
>> > on some of the events.
>> >
>> > ===================
>> > Query (id=9a4ed4eabe44c9e5:3f0cde6300000000)
>> >   Summary
>> >     Session ID: 4c40102c98913f44:780678e7979e929b
>> >     Session Type: BEESWAX
>> >     Start Time: 2018-08-13 08:00:02.757409000
>> >     End Time: 2018-08-13 08:09:15.258627000
>> >     Query Type: DDL
>> >     Query State: FINISHED
>> >     Query Status: OK
>> >     Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build
>> > 2511805f1eaa991df1460276c7e9f19d819cd4e4)
>> >     User: AAAA
>> >     Connected User: AAAA
>> >     Delegated User:
>> >     Network Address: ::ffff:172.16.136.1:48037
>> >     Default Db: default
>> >     Sql Statement: alter table BBBB recover partitions
>> >     Coordinator: CCCC:22000
>> >     Query Options (set by configuration): SYNC_DDL=1
>> >     Query Options (set by configuration and planner):
>> SYNC_DDL=1,MT_DOP=0
>> >     DDL Type: ALTER_TABLE
>> >
>> >     Query Timeline
>> >       Query submitted: 316.03us (316031)
>> >       Planning finished: 5.65s (5649375629)
>> >       Request finished: 9.2m (552495528168)
>> >       Unregister query: 9.2m (552500950559)
>> >   ImpalaServer
>> >     - CatalogOpExecTimer: 10.73s (10730796494)
>> >     - ClientFetchWaitTimer: 5ms (5411560)
>> >     - InactiveTotalTime: 0ns (0)
>> >     - RowMaterializationTimer: 0ns (0)
>> >     - TotalTime: 0ns (0)
>> >
>> > ==================
>> >
>> > Query (id=ae4266aad3cea1ed:754c9c3400000000)
>> >   Summary
>> >     Session ID: 24401399943bebf8:96c267d02619e7ac
>> >     Session Type: BEESWAX
>> >     Start Time: 2018-08-13 08:00:10.625885000
>> >     End Time: 2018-08-13 08:09:15.194417000
>> >     Query Type: DDL
>> >     Query State: FINISHED
>> >     Query Status: OK
>> >     Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build
>> > 2511805f1eaa991df1460276c7e9f19d819cd4e4)
>> >     User: AAAA
>> >     Connected User: AAAA
>> >     Delegated User:
>> >     Network Address: ::ffff:172.16.136.1:48044
>> >     Default Db: default
>> >     Sql Statement: alter table DDDD recover partitions
>> >     Coordinator: EEEE:22000
>> >     Query Options (set by configuration): SYNC_DDL=1
>> >     Query Options (set by configuration and planner):
>> SYNC_DDL=1,MT_DOP=0
>> >     DDL Type: ALTER_TABLE
>> >
>> >     Query Timeline
>> >       Query submitted: 502.36us (502357)
>> >       Planning finished: 1ms (1077718)
>> >       Request finished: 9.1m (544563396235)
>> >       Unregister query: 9.1m (544568289284)
>> >   ImpalaServer
>> >     - CatalogOpExecTimer: 8.5m (511375736191)
>> >     - ClientFetchWaitTimer: 4ms (4882019)
>> >     - InactiveTotalTime: 0ns (0)
>> >     - RowMaterializationTimer: 0ns (0)
>> >     - TotalTime: 0ns (0)
>> >
>> >
>> >
>> >
>> > On Mon, Aug 13, 2018 at 12:55 PM Jeszy <jes...@gmail.com> wrote:
>> >>
>> >> Hey Fawze,
>> >>
>> >> Hm.
>> >> Just to make sure I got this right: you have 100 tables, each
>> >> partitioned by y/m/d, and you're updating a single partition of all
>> >> 100 tables every 20 minutes via a Spark job. Is that correct? I can't
>> >> think of a way to optimize your current setup for statement count
>> >> specifically (no way to refresh 100 tables in less than 100
>> >> statements).
>> >> However, it sounds like you would benefit from using Kudu in this
>> >> case. With Kudu, you don't need to REFRESH / RECOVER to pick up new
>> >> data, it becomes available immediately after ingestion. You could
>> >> create a landing table in Kudu, then migrate data to HDFS daily (or
>> >> so), and query a view UNIONing these two tables. With the daily
>> >> Kudu->HDFS move, you also remove the need for compaction on the HDFS
>> >> side.
>> >>
>> >> HTH
>> >> Jeszy
>> >>
>> >> On 13 August 2018 at 11:08, Fawze Abujaber <fawz...@gmail.com> wrote:
>> >> > Hi Community,
>> >> >
>> >> > I have a Spark Job that producing parquet files at the HDFS with
>> >> > partitions
>> >> > Year, Month  and Day.
>> >> > The HDFS structure has 100 folders ( 1 event per folder, and these
>> >> > events
>> >> > partitioned by Year, month and day).
>> >> > The job is running each 20 minutes and writes files in the 100 events
>> >> > folders ( adding one file under the relevant partition for each
>> event).
>> >> > In top of each event i have an external impala table that i defined
>> >> > using
>> >> > impala with partitions year, month and day.
>> >> >
>> >> > Is there away to avoid running ALTER TABLE AAAA Recover partitions on
>> >> > the
>> >> > 100 tables in each 20 minutes? ( The Recover statement running using
>> >> > external cron that the main folder and run recover partitions on all
>> the
>> >> > events under the folder)
>> >> >
>> >> > I know that RECOVER PARTITIONS clause scans a partitioned table to
>> >> > detect if
>> >> > any new partition directories were added outside of Impala, but
>> >> > wondering if
>> >> > there is any other ways to avoid running 4800 statements per a day
>> while
>> >> > keeping the refreshment rate is high.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Take Care
>> >> > Fawze Abujaber
>> >
>> >
>> >
>> > --
>> > Take Care
>> > Fawze Abujaber
>>
>
>
> --
> Take Care
> Fawze Abujaber
>


-- 
Take Care
Fawze Abujaber

Re: Avoiding running Recover partitions very frequent

Reply via email to