Rajesh Balamohan created HIVE-23597: ---------------------------------------
Summary: VectorizedOrcAcidRowBatchReader::ColumnizedDeleteEventRegistry reads delete delta directories multiple times Key: HIVE-23597 URL: https://issues.apache.org/jira/browse/HIVE-23597 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1562] {code:java} try { final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit); if (deleteDeltaDirs.length > 0) { int totalDeleteEventCount = 0; for (Path deleteDeltaDir : deleteDeltaDirs) { {code} Consider a directory layout like the following. This was created by having simple set of "insert --> update --> select" queries. {noformat} /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000001 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/base_0000002 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000003_0000003_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000004_0000004_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000005_0000005_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000006_0000006_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000007_0000007_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000008_0000008_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000009_0000009_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000010_0000010_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000011_0000011_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000012_0000012_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delete_delta_0000013_0000013_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000003_0000003_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000004_0000004_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000005_0000005_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000006_0000006_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000007_0000007_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000008_0000008_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000009_0000009_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000010_0000010_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000011_0000011_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000012_0000012_0000 /warehouse-1591131255-hl5z/warehouse/tablespace/managed/hive/sequential_update_4/delta_0000013_0000013_0000 {noformat} Orcsplit contains all the delete delta folder information. For the directory layout like this, it would create {{~12 splits}}. For every split, it constructs "ColumnizedDeleteEventRegistry" in VRBAcidReader and ends up reading all these delete delta folders multiple times. In this case, it would read it approximately {{121 times!}}. This causes huge delay in running simple queries like "{{select * from tab_x}}" in cloud storage. -- This message was sent by Atlassian Jira (v8.3.4#803005)