[jira] [Commented] (IMPALA-12933) Catalogd should set eventTypeSkipList when fetching specifit events for a table
[ https://issues.apache.org/jira/browse/IMPALA-12933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839699#comment-17839699 ] ASF subversion and git services commented on IMPALA-12933: -- Commit 0767d656ef00a381441fdcc3ebb3f146fb0d179c in impala's branch refs/heads/branch-4.4.0 from stiga-huang [ https://gitbox.apache.org/repos/asf?p=impala.git;h=0767d656e ] IMPALA-12933: Avoid fetching unneccessary events of unwanted types There are several places where catalogd will fetch all events of a specific type on a table. E.g. in TableLoader#load(), if the table has an old createEventId, catalogd will fetch all CREATE_TABLE events after that createEventId on the table. Fetching the list of events is expensive since the filtering is done on client side, i.e. catalogd fetches all events and filter them locally based on the event type and table name. This could take hours if there are lots of events (e.g 1M) in HMS. This patch sets the eventTypeSkipList with the complement set of the wanted type. So the get_next_notification RPC can filter out some events on HMS side. To avoid bringing too much computation overhead to HMS's underlying RDBMS in evaluating predicates of EVENT_TYPE != 'xxx', rare event types (e.g. DROP_ISCHEMA) are not added in the list. A new flag, common_hms_event_types, is added to specify the common HMS event types. Once HIVE-28146 is resolved, we can set the wanted types directly in the HMS RPC and this approach can be simplified. UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT are the most common unused events for Impala. They are also added to the default skip list. A new flag, default_skipped_hms_event_types, is added to configure this list. This patch also fixes an issue that events of the non-default catalog are not filtered out. In a local perf test, I generated 100K RELOAD events after creating a table in Hive. Then use the table in Impala to trigger metadata loading on it which will fetch the latest CREATE_TABLE event by polling all events after the last known CREATE_TABLE event. Before this patch, fetching the events takes 1s779ms. Now it takes only 395.377ms. Note that in prod env, the event messages are usually larger, we could have a larger speedup. Tests: - Added an FE test - Ran CORE tests Change-Id: Ieabe714328aa2cc605cb62b85ae8aa4bd537dbe9 Reviewed-on: http://gerrit.cloudera.org:8080/21186 Reviewed-by: Csaba Ringhofer Tested-by: Impala Public Jenkins > Catalogd should set eventTypeSkipList when fetching specifit events for a > table > --- > > Key: IMPALA-12933 > URL: https://issues.apache.org/jira/browse/IMPALA-12933 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Quanlong Huang >Priority: Critical > Fix For: Impala 4.4.0 > > > There are several places that catalogd will fetch all events of a specifit > type on a table. E.g. in TableLoader#load(), if the table has an old > createEventId, catalogd will fetch all CREATE_TABLE events after that > createEventId on the table. > Fetching the list of events is expensive since the filtering is done on > client side, i.e. catalogd fetch all events and filter them locally based on > the event type and table name: > [https://github.com/apache/impala/blob/14e3ed4f97292499b2e6ee8d5a756dc648d9/fe/src/main/java/org/apache/impala/catalog/TableLoader.java#L98-L102] > [https://github.com/apache/impala/blob/b7ddbcad0dd6accb559a3f391a897a8c442d1728/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L336] > This could take hours if there are lots of events (e.g 1M) in HMS. In fact, > NotificationEventRequest can specify an eventTypeSkipList. Catalogd can do > the filtering of event type in HMS side. On higher Hive versions that have > HIVE-27499, catalogd can also specify the table name in the request > (IMPALA-12607). > This Jira focus on specifying the eventTypeSkipList when fetching events of a > particular type on a table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12933) Catalogd should set eventTypeSkipList when fetching specifit events for a table
[ https://issues.apache.org/jira/browse/IMPALA-12933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839297#comment-17839297 ] ASF subversion and git services commented on IMPALA-12933: -- Commit db09d58ef767b2b759792412efcc9481777c464b in impala's branch refs/heads/master from stiga-huang [ https://gitbox.apache.org/repos/asf?p=impala.git;h=db09d58ef ] IMPALA-12933: Avoid fetching unneccessary events of unwanted types There are several places where catalogd will fetch all events of a specific type on a table. E.g. in TableLoader#load(), if the table has an old createEventId, catalogd will fetch all CREATE_TABLE events after that createEventId on the table. Fetching the list of events is expensive since the filtering is done on client side, i.e. catalogd fetches all events and filter them locally based on the event type and table name. This could take hours if there are lots of events (e.g 1M) in HMS. This patch sets the eventTypeSkipList with the complement set of the wanted type. So the get_next_notification RPC can filter out some events on HMS side. To avoid bringing too much computation overhead to HMS's underlying RDBMS in evaluating predicates of EVENT_TYPE != 'xxx', rare event types (e.g. DROP_ISCHEMA) are not added in the list. A new flag, common_hms_event_types, is added to specify the common HMS event types. Once HIVE-28146 is resolved, we can set the wanted types directly in the HMS RPC and this approach can be simplified. UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT are the most common unused events for Impala. They are also added to the default skip list. A new flag, default_skipped_hms_event_types, is added to configure this list. This patch also fixes an issue that events of the non-default catalog are not filtered out. In a local perf test, I generated 100K RELOAD events after creating a table in Hive. Then use the table in Impala to trigger metadata loading on it which will fetch the latest CREATE_TABLE event by polling all events after the last known CREATE_TABLE event. Before this patch, fetching the events takes 1s779ms. Now it takes only 395.377ms. Note that in prod env, the event messages are usually larger, we could have a larger speedup. Tests: - Added an FE test - Ran CORE tests Change-Id: Ieabe714328aa2cc605cb62b85ae8aa4bd537dbe9 Reviewed-on: http://gerrit.cloudera.org:8080/21186 Reviewed-by: Csaba Ringhofer Tested-by: Impala Public Jenkins > Catalogd should set eventTypeSkipList when fetching specifit events for a > table > --- > > Key: IMPALA-12933 > URL: https://issues.apache.org/jira/browse/IMPALA-12933 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Quanlong Huang >Priority: Critical > > There are several places that catalogd will fetch all events of a specifit > type on a table. E.g. in TableLoader#load(), if the table has an old > createEventId, catalogd will fetch all CREATE_TABLE events after that > createEventId on the table. > Fetching the list of events is expensive since the filtering is done on > client side, i.e. catalogd fetch all events and filter them locally based on > the event type and table name: > [https://github.com/apache/impala/blob/14e3ed4f97292499b2e6ee8d5a756dc648d9/fe/src/main/java/org/apache/impala/catalog/TableLoader.java#L98-L102] > [https://github.com/apache/impala/blob/b7ddbcad0dd6accb559a3f391a897a8c442d1728/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L336] > This could take hours if there are lots of events (e.g 1M) in HMS. In fact, > NotificationEventRequest can specify an eventTypeSkipList. Catalogd can do > the filtering of event type in HMS side. On higher Hive versions that have > HIVE-27499, catalogd can also specify the table name in the request > (IMPALA-12607). > This Jira focus on specifying the eventTypeSkipList when fetching events of a > particular type on a table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12933) Catalogd should set eventTypeSkipList when fetching specifit events for a table
[ https://issues.apache.org/jira/browse/IMPALA-12933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829845#comment-17829845 ] Quanlong Huang commented on IMPALA-12933: - In IMPALA-12399, we added OPEN_TXN to the eventTypeSkipList. We can also add UPDATE_PART_COL_STAT_EVENT which is also unused by Impala. > Catalogd should set eventTypeSkipList when fetching specifit events for a > table > --- > > Key: IMPALA-12933 > URL: https://issues.apache.org/jira/browse/IMPALA-12933 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Quanlong Huang >Priority: Critical > > There are several places that catalogd will fetch all events of a specifit > type on a table. E.g. in TableLoader#load(), if the table has an old > createEventId, catalogd will fetch all CREATE_TABLE events after that > createEventId on the table. > Fetching the list of events is expensive since the filtering is done on > client side, i.e. catalogd fetch all events and filter them locally based on > the event type and table name: > [https://github.com/apache/impala/blob/14e3ed4f97292499b2e6ee8d5a756dc648d9/fe/src/main/java/org/apache/impala/catalog/TableLoader.java#L98-L102] > [https://github.com/apache/impala/blob/b7ddbcad0dd6accb559a3f391a897a8c442d1728/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L336] > This could take hours if there are lots of events (e.g 1M) in HMS. In fact, > NotificationEventRequest can specify an eventTypeSkipList. Catalogd can do > the filtering of event type in HMS side. On higher Hive versions that have > HIVE-27499, catalogd can also specify the table name in the request > (IMPALA-12607). > This Jira focus on specifying the eventTypeSkipList when fetching events of a > particular type on a table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org