Re: Avoiding running Recover partitions very frequent

Fawze Abujaber Mon, 13 Aug 2018 07:00:28 -0700

Hi Jezy,

My problem that it is happening with specific events and not a sporadic,
and this is what let me eliminate the suspect that i may have a bottleneck
at the NameNode or the Catalog Server.


On Mon, Aug 13, 2018 at 4:52 PM Jeszy <jes...@gmail.com> wrote:

> I'd try to trace the update through catalog and statestore. SYNC_DDL=1
> can be a problem especially if there's a slow impalad or a lot of
> catalog updates concurrently (lot of data to stream from statestore
> node). Namenode can also become a bottleneck. Catalog logs will help
> point these out.
>
> On 13 August 2018 at 14:30, Fawze Abujaber <fawz...@gmail.com> wrote:
> > Thanks Jezy for your quick response, We are far away from moving to Kudu.
> >
> > Trying to figure out what can cause Recover partition to run for a long
> time
> > on some of the events.
> >
> > ===================
> > Query (id=9a4ed4eabe44c9e5:3f0cde6300000000)
> >   Summary
> >     Session ID: 4c40102c98913f44:780678e7979e929b
> >     Session Type: BEESWAX
> >     Start Time: 2018-08-13 08:00:02.757409000
> >     End Time: 2018-08-13 08:09:15.258627000
> >     Query Type: DDL
> >     Query State: FINISHED
> >     Query Status: OK
> >     Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build
> > 2511805f1eaa991df1460276c7e9f19d819cd4e4)
> >     User: AAAA
> >     Connected User: AAAA
> >     Delegated User:
> >     Network Address: ::ffff:172.16.136.1:48037
> >     Default Db: default
> >     Sql Statement: alter table BBBB recover partitions
> >     Coordinator: CCCC:22000
> >     Query Options (set by configuration): SYNC_DDL=1
> >     Query Options (set by configuration and planner): SYNC_DDL=1,MT_DOP=0
> >     DDL Type: ALTER_TABLE
> >
> >     Query Timeline
> >       Query submitted: 316.03us (316031)
> >       Planning finished: 5.65s (5649375629)
> >       Request finished: 9.2m (552495528168)
> >       Unregister query: 9.2m (552500950559)
> >   ImpalaServer
> >     - CatalogOpExecTimer: 10.73s (10730796494)
> >     - ClientFetchWaitTimer: 5ms (5411560)
> >     - InactiveTotalTime: 0ns (0)
> >     - RowMaterializationTimer: 0ns (0)
> >     - TotalTime: 0ns (0)
> >
> > ==================
> >
> > Query (id=ae4266aad3cea1ed:754c9c3400000000)
> >   Summary
> >     Session ID: 24401399943bebf8:96c267d02619e7ac
> >     Session Type: BEESWAX
> >     Start Time: 2018-08-13 08:00:10.625885000
> >     End Time: 2018-08-13 08:09:15.194417000
> >     Query Type: DDL
> >     Query State: FINISHED
> >     Query Status: OK
> >     Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build
> > 2511805f1eaa991df1460276c7e9f19d819cd4e4)
> >     User: AAAA
> >     Connected User: AAAA
> >     Delegated User:
> >     Network Address: ::ffff:172.16.136.1:48044
> >     Default Db: default
> >     Sql Statement: alter table DDDD recover partitions
> >     Coordinator: EEEE:22000
> >     Query Options (set by configuration): SYNC_DDL=1
> >     Query Options (set by configuration and planner): SYNC_DDL=1,MT_DOP=0
> >     DDL Type: ALTER_TABLE
> >
> >     Query Timeline
> >       Query submitted: 502.36us (502357)
> >       Planning finished: 1ms (1077718)
> >       Request finished: 9.1m (544563396235)
> >       Unregister query: 9.1m (544568289284)
> >   ImpalaServer
> >     - CatalogOpExecTimer: 8.5m (511375736191)
> >     - ClientFetchWaitTimer: 4ms (4882019)
> >     - InactiveTotalTime: 0ns (0)
> >     - RowMaterializationTimer: 0ns (0)
> >     - TotalTime: 0ns (0)
> >
> >
> >
> >
> > On Mon, Aug 13, 2018 at 12:55 PM Jeszy <jes...@gmail.com> wrote:
> >>
> >> Hey Fawze,
> >>
> >> Hm.
> >> Just to make sure I got this right: you have 100 tables, each
> >> partitioned by y/m/d, and you're updating a single partition of all
> >> 100 tables every 20 minutes via a Spark job. Is that correct? I can't
> >> think of a way to optimize your current setup for statement count
> >> specifically (no way to refresh 100 tables in less than 100
> >> statements).
> >> However, it sounds like you would benefit from using Kudu in this
> >> case. With Kudu, you don't need to REFRESH / RECOVER to pick up new
> >> data, it becomes available immediately after ingestion. You could
> >> create a landing table in Kudu, then migrate data to HDFS daily (or
> >> so), and query a view UNIONing these two tables. With the daily
> >> Kudu->HDFS move, you also remove the need for compaction on the HDFS
> >> side.
> >>
> >> HTH
> >> Jeszy
> >>
> >> On 13 August 2018 at 11:08, Fawze Abujaber <fawz...@gmail.com> wrote:
> >> > Hi Community,
> >> >
> >> > I have a Spark Job that producing parquet files at the HDFS with
> >> > partitions
> >> > Year, Month  and Day.
> >> > The HDFS structure has 100 folders ( 1 event per folder, and these
> >> > events
> >> > partitioned by Year, month and day).
> >> > The job is running each 20 minutes and writes files in the 100 events
> >> > folders ( adding one file under the relevant partition for each
> event).
> >> > In top of each event i have an external impala table that i defined
> >> > using
> >> > impala with partitions year, month and day.
> >> >
> >> > Is there away to avoid running ALTER TABLE AAAA Recover partitions on
> >> > the
> >> > 100 tables in each 20 minutes? ( The Recover statement running using
> >> > external cron that the main folder and run recover partitions on all
> the
> >> > events under the folder)
> >> >
> >> > I know that RECOVER PARTITIONS clause scans a partitioned table to
> >> > detect if
> >> > any new partition directories were added outside of Impala, but
> >> > wondering if
> >> > there is any other ways to avoid running 4800 statements per a day
> while
> >> > keeping the refreshment rate is high.
> >> >
> >> >
> >> >
> >> > --
> >> > Take Care
> >> > Fawze Abujaber
> >
> >
> >
> > --
> > Take Care
> > Fawze Abujaber
>


-- 
Take Care
Fawze Abujaber

Re: Avoiding running Recover partitions very frequent

Reply via email to