I'm afraid I have another guess. In his scene, did TEZ-4248 fail?


---- Replied Message ----
| From | Sungwoo Park<glap...@gmail.com> |
| Date | 08/27/2025 15:02 |
| To | user@hive.apache.org |
| Cc | |
| Subject | Re: ORC Predicate Pushdown (SARG) Not Applied (allowSARGToFilter: 
false) |
Hi,
I have a quick question. Did you try setting orc.sarg.to.filter to true in 
hive-site.xml?


--- Sungwoo


On Wed, Aug 27, 2025 at 3:02 PM 서연 <seoyonie...@gmail.com> wrote:

Hello Hive Development Team,


We are observing a significant performance issue with queries on a 
non-partitioned ORC table. Our investigation indicates that ORC predicate 
pushdown (SARG) is not being applied at the storage layer, forcing full data 
scans instead of efficient, filtered reads.


From the TezChild logs, we can see that Hive correctly identifies the pushdown 
predicate. However, it then explicitly instructs the ORC reader to ignore it 
for filtering by setting the allowSARGToFilter option to false.



```
2025-08-27 13:21:52,149 [INFO] [TezChild] |orc.OrcInputFormat|: ORC pushdown 
predicate: (and leaf-(BETWEEN inv_quantity_on_hand 100 500) (not leaf-(IS_NULL 
inv_item_sk)) (not leaf-(IS_NULL inv_date_sk)))
2025-08-27 13:21:52,149 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows 
from hdfs://.../inventory/000000_0 with {..., sarg: (and leaf-(BETWEEN 
inv_quantity_on_hand 100 500) ...), ..., allowSARGToFilter: false, ...}
```



However, we have confirmed that when we run the exact same query on the same 
data in our Hive 2.3.2 environment, predicate pushdown works correctly, and the 
data is filtered at the ORC reader level as expected.

Our hypothesis is that this difference is due to changes in the ORC 
integration. We suspect that the ORC version used in Hive 2.3.2 (likely ORC 
1.3.3) did not have the allowSARGToFilter parameter and would always apply a 
filter if a sarg was present. The introduction of this flag in newer versions 
seems to have inadvertently caused this performance regression in our use case.

Given this, we strongly believe that there should be a way for users to control 
this behavior. We propose that Hive should provide a configuration (e.g., a 
session variable or a table property) to explicitly set allowSARGToFilter to 
true. This would restore the efficient behavior of older versions and provide a 
crucial performance tuning capability.

What are your thoughts on this? Is our analysis correct, and would you be open 
to considering such a feature?

For context, here is our environment information:
Hive Version: 4.0.1
Execution Engine: 0.10.4
Query : tpcds scale 300 query82



Thank you for your time and any guidance you can offer.

Best regards,

seoyeon.

Reply via email to