Hello, Which one more appropriate for my case ( Spark job that write new files in a partitioned HDFS), ALTER TABLE recover partitions or the REFRESH statement( on specific partition)?
as i'm running 4800 statement in a day and we would like to run the best practises for my case. On Mon, Aug 13, 2018 at 5:00 PM Fawze Abujaber <fawz...@gmail.com> wrote: > Hi Jezy, > > My problem that it is happening with specific events and not a sporadic, > and this is what let me eliminate the suspect that i may have a bottleneck > at the NameNode or the Catalog Server. > > On Mon, Aug 13, 2018 at 4:52 PM Jeszy <jes...@gmail.com> wrote: > >> I'd try to trace the update through catalog and statestore. SYNC_DDL=1 >> can be a problem especially if there's a slow impalad or a lot of >> catalog updates concurrently (lot of data to stream from statestore >> node). Namenode can also become a bottleneck. Catalog logs will help >> point these out. >> >> On 13 August 2018 at 14:30, Fawze Abujaber <fawz...@gmail.com> wrote: >> > Thanks Jezy for your quick response, We are far away from moving to >> Kudu. >> > >> > Trying to figure out what can cause Recover partition to run for a long >> time >> > on some of the events. >> > >> > =================== >> > Query (id=9a4ed4eabe44c9e5:3f0cde6300000000) >> > Summary >> > Session ID: 4c40102c98913f44:780678e7979e929b >> > Session Type: BEESWAX >> > Start Time: 2018-08-13 08:00:02.757409000 >> > End Time: 2018-08-13 08:09:15.258627000 >> > Query Type: DDL >> > Query State: FINISHED >> > Query Status: OK >> > Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build >> > 2511805f1eaa991df1460276c7e9f19d819cd4e4) >> > User: AAAA >> > Connected User: AAAA >> > Delegated User: >> > Network Address: ::ffff:172.16.136.1:48037 >> > Default Db: default >> > Sql Statement: alter table BBBB recover partitions >> > Coordinator: CCCC:22000 >> > Query Options (set by configuration): SYNC_DDL=1 >> > Query Options (set by configuration and planner): >> SYNC_DDL=1,MT_DOP=0 >> > DDL Type: ALTER_TABLE >> > >> > Query Timeline >> > Query submitted: 316.03us (316031) >> > Planning finished: 5.65s (5649375629) >> > Request finished: 9.2m (552495528168) >> > Unregister query: 9.2m (552500950559) >> > ImpalaServer >> > - CatalogOpExecTimer: 10.73s (10730796494) >> > - ClientFetchWaitTimer: 5ms (5411560) >> > - InactiveTotalTime: 0ns (0) >> > - RowMaterializationTimer: 0ns (0) >> > - TotalTime: 0ns (0) >> > >> > ================== >> > >> > Query (id=ae4266aad3cea1ed:754c9c3400000000) >> > Summary >> > Session ID: 24401399943bebf8:96c267d02619e7ac >> > Session Type: BEESWAX >> > Start Time: 2018-08-13 08:00:10.625885000 >> > End Time: 2018-08-13 08:09:15.194417000 >> > Query Type: DDL >> > Query State: FINISHED >> > Query Status: OK >> > Impala Version: impalad version 2.10.0-cdh5.13.0 RELEASE (build >> > 2511805f1eaa991df1460276c7e9f19d819cd4e4) >> > User: AAAA >> > Connected User: AAAA >> > Delegated User: >> > Network Address: ::ffff:172.16.136.1:48044 >> > Default Db: default >> > Sql Statement: alter table DDDD recover partitions >> > Coordinator: EEEE:22000 >> > Query Options (set by configuration): SYNC_DDL=1 >> > Query Options (set by configuration and planner): >> SYNC_DDL=1,MT_DOP=0 >> > DDL Type: ALTER_TABLE >> > >> > Query Timeline >> > Query submitted: 502.36us (502357) >> > Planning finished: 1ms (1077718) >> > Request finished: 9.1m (544563396235) >> > Unregister query: 9.1m (544568289284) >> > ImpalaServer >> > - CatalogOpExecTimer: 8.5m (511375736191) >> > - ClientFetchWaitTimer: 4ms (4882019) >> > - InactiveTotalTime: 0ns (0) >> > - RowMaterializationTimer: 0ns (0) >> > - TotalTime: 0ns (0) >> > >> > >> > >> > >> > On Mon, Aug 13, 2018 at 12:55 PM Jeszy <jes...@gmail.com> wrote: >> >> >> >> Hey Fawze, >> >> >> >> Hm. >> >> Just to make sure I got this right: you have 100 tables, each >> >> partitioned by y/m/d, and you're updating a single partition of all >> >> 100 tables every 20 minutes via a Spark job. Is that correct? I can't >> >> think of a way to optimize your current setup for statement count >> >> specifically (no way to refresh 100 tables in less than 100 >> >> statements). >> >> However, it sounds like you would benefit from using Kudu in this >> >> case. With Kudu, you don't need to REFRESH / RECOVER to pick up new >> >> data, it becomes available immediately after ingestion. You could >> >> create a landing table in Kudu, then migrate data to HDFS daily (or >> >> so), and query a view UNIONing these two tables. With the daily >> >> Kudu->HDFS move, you also remove the need for compaction on the HDFS >> >> side. >> >> >> >> HTH >> >> Jeszy >> >> >> >> On 13 August 2018 at 11:08, Fawze Abujaber <fawz...@gmail.com> wrote: >> >> > Hi Community, >> >> > >> >> > I have a Spark Job that producing parquet files at the HDFS with >> >> > partitions >> >> > Year, Month and Day. >> >> > The HDFS structure has 100 folders ( 1 event per folder, and these >> >> > events >> >> > partitioned by Year, month and day). >> >> > The job is running each 20 minutes and writes files in the 100 events >> >> > folders ( adding one file under the relevant partition for each >> event). >> >> > In top of each event i have an external impala table that i defined >> >> > using >> >> > impala with partitions year, month and day. >> >> > >> >> > Is there away to avoid running ALTER TABLE AAAA Recover partitions on >> >> > the >> >> > 100 tables in each 20 minutes? ( The Recover statement running using >> >> > external cron that the main folder and run recover partitions on all >> the >> >> > events under the folder) >> >> > >> >> > I know that RECOVER PARTITIONS clause scans a partitioned table to >> >> > detect if >> >> > any new partition directories were added outside of Impala, but >> >> > wondering if >> >> > there is any other ways to avoid running 4800 statements per a day >> while >> >> > keeping the refreshment rate is high. >> >> > >> >> > >> >> > >> >> > -- >> >> > Take Care >> >> > Fawze Abujaber >> > >> > >> > >> > -- >> > Take Care >> > Fawze Abujaber >> > > > -- > Take Care > Fawze Abujaber > -- Take Care Fawze Abujaber