Sahil Takiar created IMPALA-10117:
-------------------------------------

             Summary: Skip calls to FsPermissionCache for blob stores
                 Key: IMPALA-10117
                 URL: https://issues.apache.org/jira/browse/IMPALA-10117
             Project: IMPALA
          Issue Type: Improvement
            Reporter: Sahil Takiar


The {{FsPermissionCache}} is described as:
{code:java}
/**
 * Simple non-thread-safe cache for resolved file permissions. This allows
 * pre-caching permissions by listing the status of all files within a 
directory,
 * and then using that cache to avoid round trips to the FileSystem for later
 * queries of those paths.
 */ {code}

I confirmed, and {{FsPermissionCache#precacheChildrenOf}} is actually called 
for data stored on S3. The issue is that {{FsPermissionCache#getPermissions}} 
is called inside {{HdfsTable#getAvailableAccessLevel}}, which is skipped for 
S3. So all the cached metadata is not used. The problem is that 
{{precacheChildrenOf}} calls {{getFileStatus}} for all files, which results in 
a bunch of unnecessary metadata operations to S3 + a bunch of cached metadata 
that is never used.

{{precacheChildrenOf}} is actually only invoked in the specific scenario 
described below:
{code}
    // Only preload permissions if the number of partitions to be added is
    // large (3x) relative to the number of existing partitions. This covers
    // two common cases:
    //
    // 1) initial load of a table (no existing partition metadata)
    // 2) ALTER TABLE RECOVER PARTITIONS after creating a table pointing to
    // an already-existing partition directory tree
    //
    // Without this heuristic, we would end up using a "listStatus" call to
    // potentially fetch a bunch of irrelevant information about existing
    // partitions when we only want to know about a small number of newly-added
    // partitions.
{code}

Regardless, skipping the call to {{precacheChildrenOf}} for blob stores should 
(1) improve table loading time for S3 backed tables, and (2) decrease catalogd 
memory requirements when loading a bunch of tables stored on S3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to