[jira] [Commented] (IMPALA-12933) Catalogd should set eventTypeSkipList when fetching specifit events for a table

2024-04-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839699#comment-17839699
 ] 

ASF subversion and git services commented on IMPALA-12933:
--

Commit 0767d656ef00a381441fdcc3ebb3f146fb0d179c in impala's branch 
refs/heads/branch-4.4.0 from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0767d656e ]

IMPALA-12933: Avoid fetching unneccessary events of unwanted types

There are several places where catalogd will fetch all events of a
specific type on a table. E.g. in TableLoader#load(), if the table has
an old createEventId, catalogd will fetch all CREATE_TABLE events after
that createEventId on the table.

Fetching the list of events is expensive since the filtering is done on
client side, i.e. catalogd fetches all events and filter them locally
based on the event type and table name. This could take hours if there
are lots of events (e.g 1M) in HMS.

This patch sets the eventTypeSkipList with the complement set of the
wanted type. So the get_next_notification RPC can filter out some events
on HMS side. To avoid bringing too much computation overhead to HMS's
underlying RDBMS in evaluating predicates of EVENT_TYPE != 'xxx', rare
event types (e.g. DROP_ISCHEMA) are not added in the list. A new flag,
common_hms_event_types, is added to specify the common HMS event types.

Once HIVE-28146 is resolved, we can set the wanted types directly in the
HMS RPC and this approach can be simplified.

UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT are the most
common unused events for Impala. They are also added to the default skip
list. A new flag, default_skipped_hms_event_types, is added to configure
this list.

This patch also fixes an issue that events of the non-default catalog
are not filtered out.

In a local perf test, I generated 100K RELOAD events after creating a
table in Hive. Then use the table in Impala to trigger metadata loading
on it which will fetch the latest CREATE_TABLE event by polling all
events after the last known CREATE_TABLE event. Before this patch,
fetching the events takes 1s779ms. Now it takes only 395.377ms. Note
that in prod env, the event messages are usually larger, we could have
a larger speedup.

Tests:
 - Added an FE test
 - Ran CORE tests

Change-Id: Ieabe714328aa2cc605cb62b85ae8aa4bd537dbe9
Reviewed-on: http://gerrit.cloudera.org:8080/21186
Reviewed-by: Csaba Ringhofer 
Tested-by: Impala Public Jenkins 


> Catalogd should set eventTypeSkipList when fetching specifit events for a 
> table
> ---
>
> Key: IMPALA-12933
> URL: https://issues.apache.org/jira/browse/IMPALA-12933
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Fix For: Impala 4.4.0
>
>
> There are several places that catalogd will fetch all events of a specifit 
> type on a table. E.g. in TableLoader#load(), if the table has an old 
> createEventId, catalogd will fetch all CREATE_TABLE events after that 
> createEventId on the table.
> Fetching the list of events is expensive since the filtering is done on 
> client side, i.e. catalogd fetch all events and filter them locally based on 
> the event type and table name:
> [https://github.com/apache/impala/blob/14e3ed4f97292499b2e6ee8d5a756dc648d9/fe/src/main/java/org/apache/impala/catalog/TableLoader.java#L98-L102]
> [https://github.com/apache/impala/blob/b7ddbcad0dd6accb559a3f391a897a8c442d1728/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L336]
> This could take hours if there are lots of events (e.g 1M) in HMS. In fact, 
> NotificationEventRequest can specify an eventTypeSkipList. Catalogd can do 
> the filtering of event type in HMS side. On higher Hive versions that have 
> HIVE-27499, catalogd can also specify the table name in the request 
> (IMPALA-12607).
> This Jira focus on specifying the eventTypeSkipList when fetching events of a 
> particular type on a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12933) Catalogd should set eventTypeSkipList when fetching specifit events for a table

2024-04-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839297#comment-17839297
 ] 

ASF subversion and git services commented on IMPALA-12933:
--

Commit db09d58ef767b2b759792412efcc9481777c464b in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=db09d58ef ]

IMPALA-12933: Avoid fetching unneccessary events of unwanted types

There are several places where catalogd will fetch all events of a
specific type on a table. E.g. in TableLoader#load(), if the table has
an old createEventId, catalogd will fetch all CREATE_TABLE events after
that createEventId on the table.

Fetching the list of events is expensive since the filtering is done on
client side, i.e. catalogd fetches all events and filter them locally
based on the event type and table name. This could take hours if there
are lots of events (e.g 1M) in HMS.

This patch sets the eventTypeSkipList with the complement set of the
wanted type. So the get_next_notification RPC can filter out some events
on HMS side. To avoid bringing too much computation overhead to HMS's
underlying RDBMS in evaluating predicates of EVENT_TYPE != 'xxx', rare
event types (e.g. DROP_ISCHEMA) are not added in the list. A new flag,
common_hms_event_types, is added to specify the common HMS event types.

Once HIVE-28146 is resolved, we can set the wanted types directly in the
HMS RPC and this approach can be simplified.

UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT are the most
common unused events for Impala. They are also added to the default skip
list. A new flag, default_skipped_hms_event_types, is added to configure
this list.

This patch also fixes an issue that events of the non-default catalog
are not filtered out.

In a local perf test, I generated 100K RELOAD events after creating a
table in Hive. Then use the table in Impala to trigger metadata loading
on it which will fetch the latest CREATE_TABLE event by polling all
events after the last known CREATE_TABLE event. Before this patch,
fetching the events takes 1s779ms. Now it takes only 395.377ms. Note
that in prod env, the event messages are usually larger, we could have
a larger speedup.

Tests:
 - Added an FE test
 - Ran CORE tests

Change-Id: Ieabe714328aa2cc605cb62b85ae8aa4bd537dbe9
Reviewed-on: http://gerrit.cloudera.org:8080/21186
Reviewed-by: Csaba Ringhofer 
Tested-by: Impala Public Jenkins 


> Catalogd should set eventTypeSkipList when fetching specifit events for a 
> table
> ---
>
> Key: IMPALA-12933
> URL: https://issues.apache.org/jira/browse/IMPALA-12933
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> There are several places that catalogd will fetch all events of a specifit 
> type on a table. E.g. in TableLoader#load(), if the table has an old 
> createEventId, catalogd will fetch all CREATE_TABLE events after that 
> createEventId on the table.
> Fetching the list of events is expensive since the filtering is done on 
> client side, i.e. catalogd fetch all events and filter them locally based on 
> the event type and table name:
> [https://github.com/apache/impala/blob/14e3ed4f97292499b2e6ee8d5a756dc648d9/fe/src/main/java/org/apache/impala/catalog/TableLoader.java#L98-L102]
> [https://github.com/apache/impala/blob/b7ddbcad0dd6accb559a3f391a897a8c442d1728/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L336]
> This could take hours if there are lots of events (e.g 1M) in HMS. In fact, 
> NotificationEventRequest can specify an eventTypeSkipList. Catalogd can do 
> the filtering of event type in HMS side. On higher Hive versions that have 
> HIVE-27499, catalogd can also specify the table name in the request 
> (IMPALA-12607).
> This Jira focus on specifying the eventTypeSkipList when fetching events of a 
> particular type on a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12933) Catalogd should set eventTypeSkipList when fetching specifit events for a table

2024-03-22 Thread Quanlong Huang (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829845#comment-17829845
 ] 

Quanlong Huang commented on IMPALA-12933:
-

In IMPALA-12399, we added OPEN_TXN to the eventTypeSkipList. We can also add 
UPDATE_PART_COL_STAT_EVENT which is also unused by Impala.

> Catalogd should set eventTypeSkipList when fetching specifit events for a 
> table
> ---
>
> Key: IMPALA-12933
> URL: https://issues.apache.org/jira/browse/IMPALA-12933
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>
> There are several places that catalogd will fetch all events of a specifit 
> type on a table. E.g. in TableLoader#load(), if the table has an old 
> createEventId, catalogd will fetch all CREATE_TABLE events after that 
> createEventId on the table.
> Fetching the list of events is expensive since the filtering is done on 
> client side, i.e. catalogd fetch all events and filter them locally based on 
> the event type and table name:
> [https://github.com/apache/impala/blob/14e3ed4f97292499b2e6ee8d5a756dc648d9/fe/src/main/java/org/apache/impala/catalog/TableLoader.java#L98-L102]
> [https://github.com/apache/impala/blob/b7ddbcad0dd6accb559a3f391a897a8c442d1728/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L336]
> This could take hours if there are lots of events (e.g 1M) in HMS. In fact, 
> NotificationEventRequest can specify an eventTypeSkipList. Catalogd can do 
> the filtering of event type in HMS side. On higher Hive versions that have 
> HIVE-27499, catalogd can also specify the table name in the request 
> (IMPALA-12607).
> This Jira focus on specifying the eventTypeSkipList when fetching events of a 
> particular type on a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org