Hi Jezy, My problem that it is happening with specific events and not a sporadic, and this is what let me eliminate the suspect that i may have a bottleneck at the NameNode or the Catalog Server.
On Mon, Aug 13, 2018 at 4:52 PM Jeszy <jes...@gmail.com> wrote: > I'd try to trace the update through catalog and statestore. SYNC_DDL=1 > can be a problem especially if there's a slow impalad or a lot of > catalog updates concurrently (lot of data to stream from statestore > node). Namenode can also become a bottleneck. Catalog logs will help > point these out. > > On 13 August 2018 at 14:30, Fawze Abujaber <fawz...@gmail.com> wrote: > > Thanks Jezy for your quick response, We are far away from moving to Kudu. > > > > Trying to figure out what can cause Recover partition to run for a long > time > > on some of the events. > > > > =================== > > Query (id=9a4ed4eabe44c9e5:3f0cde6300000000) > > Summary > > Session ID: 4c40102c98913f44:780678e7979e929b > > Session Type: BEESWAX > > Start Time: 2018-08-13 08:00:02.757409000 > > End Time: 2018-08-13 08:09:15.258627000 > > Query Type: DDL > > Query State: FINISHED > > Query Status: OK > > Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build > > 2511805f1eaa991df1460276c7e9f19d819cd4e4) > > User: AAAA > > Connected User: AAAA > > Delegated User: > > Network Address: ::ffff:172.16.136.1:48037 > > Default Db: default > > Sql Statement: alter table BBBB recover partitions > > Coordinator: CCCC:22000 > > Query Options (set by configuration): SYNC_DDL=1 > > Query Options (set by configuration and planner): SYNC_DDL=1,MT_DOP=0 > > DDL Type: ALTER_TABLE > > > > Query Timeline > > Query submitted: 316.03us (316031) > > Planning finished: 5.65s (5649375629) > > Request finished: 9.2m (552495528168) > > Unregister query: 9.2m (552500950559) > > ImpalaServer > > - CatalogOpExecTimer: 10.73s (10730796494) > > - ClientFetchWaitTimer: 5ms (5411560) > > - InactiveTotalTime: 0ns (0) > > - RowMaterializationTimer: 0ns (0) > > - TotalTime: 0ns (0) > > > > ================== > > > > Query (id=ae4266aad3cea1ed:754c9c3400000000) > > Summary > > Session ID: 24401399943bebf8:96c267d02619e7ac > > Session Type: BEESWAX > > Start Time: 2018-08-13 08:00:10.625885000 > > End Time: 2018-08-13 08:09:15.194417000 > > Query Type: DDL > > Query State: FINISHED > > Query Status: OK > > Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build > > 2511805f1eaa991df1460276c7e9f19d819cd4e4) > > User: AAAA > > Connected User: AAAA > > Delegated User: > > Network Address: ::ffff:172.16.136.1:48044 > > Default Db: default > > Sql Statement: alter table DDDD recover partitions > > Coordinator: EEEE:22000 > > Query Options (set by configuration): SYNC_DDL=1 > > Query Options (set by configuration and planner): SYNC_DDL=1,MT_DOP=0 > > DDL Type: ALTER_TABLE > > > > Query Timeline > > Query submitted: 502.36us (502357) > > Planning finished: 1ms (1077718) > > Request finished: 9.1m (544563396235) > > Unregister query: 9.1m (544568289284) > > ImpalaServer > > - CatalogOpExecTimer: 8.5m (511375736191) > > - ClientFetchWaitTimer: 4ms (4882019) > > - InactiveTotalTime: 0ns (0) > > - RowMaterializationTimer: 0ns (0) > > - TotalTime: 0ns (0) > > > > > > > > > > On Mon, Aug 13, 2018 at 12:55 PM Jeszy <jes...@gmail.com> wrote: > >> > >> Hey Fawze, > >> > >> Hm. > >> Just to make sure I got this right: you have 100 tables, each > >> partitioned by y/m/d, and you're updating a single partition of all > >> 100 tables every 20 minutes via a Spark job. Is that correct? I can't > >> think of a way to optimize your current setup for statement count > >> specifically (no way to refresh 100 tables in less than 100 > >> statements). > >> However, it sounds like you would benefit from using Kudu in this > >> case. With Kudu, you don't need to REFRESH / RECOVER to pick up new > >> data, it becomes available immediately after ingestion. You could > >> create a landing table in Kudu, then migrate data to HDFS daily (or > >> so), and query a view UNIONing these two tables. With the daily > >> Kudu->HDFS move, you also remove the need for compaction on the HDFS > >> side. > >> > >> HTH > >> Jeszy > >> > >> On 13 August 2018 at 11:08, Fawze Abujaber <fawz...@gmail.com> wrote: > >> > Hi Community, > >> > > >> > I have a Spark Job that producing parquet files at the HDFS with > >> > partitions > >> > Year, Month and Day. > >> > The HDFS structure has 100 folders ( 1 event per folder, and these > >> > events > >> > partitioned by Year, month and day). > >> > The job is running each 20 minutes and writes files in the 100 events > >> > folders ( adding one file under the relevant partition for each > event). > >> > In top of each event i have an external impala table that i defined > >> > using > >> > impala with partitions year, month and day. > >> > > >> > Is there away to avoid running ALTER TABLE AAAA Recover partitions on > >> > the > >> > 100 tables in each 20 minutes? ( The Recover statement running using > >> > external cron that the main folder and run recover partitions on all > the > >> > events under the folder) > >> > > >> > I know that RECOVER PARTITIONS clause scans a partitioned table to > >> > detect if > >> > any new partition directories were added outside of Impala, but > >> > wondering if > >> > there is any other ways to avoid running 4800 statements per a day > while > >> > keeping the refreshment rate is high. > >> > > >> > > >> > > >> > -- > >> > Take Care > >> > Fawze Abujaber > > > > > > > > -- > > Take Care > > Fawze Abujaber > -- Take Care Fawze Abujaber