Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-28 Thread Sungwoo Park
We didn't make any other attempt to fix the problem and just decided not to
use CachedStore. However, I think our installation of Metastore based on
Hive 3.1.3 is running without any serious problems.

Could you share how long it takes to compile typical queries in your
environment (with Hive 1 and with Hive 3)?

FYI, in our environment, sometimes it takes about 10 seconds to compile a
query on TPC-DS 10TB datasets. Specifically, the average compilation time
of 103 queries is 1.7 seconds (as reported by Hive), and the longest
compilation time is 9.6 seconds (query 49). The compilation time includes
the time for accessing Metastore.

Thanks,

--- Sungwoo


On Wed, Feb 28, 2024 at 9:59 PM Takanobu Asanuma 
wrote:

> Thanks for your detailed answer!
>
> In the original email, you reported "the query compilation takes long" in
> Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
> Thank you for sharing the issue with CachedStore and the JIRA tickets.
> I will also try out metastore.stats.fetch.bitvector=true.
>
> Regards,
> - Takanobu
>
> 2024年2月28日(水) 18:49 Sungwoo Park :
>
>> Hello Takanobu,
>>
>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
>> different, so I don't know why Metastore responses are very slow. I can
>> only share some results of testing CachedStore in Metastore. Please note
>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>> Hive 3.1.3 (which applies many additional patches).
>>
>> 1.
>> When CachedStore is enabled, column stats are not computed. As a result,
>> some queries generate very inefficient plans because of wrong/inaccurate
>> stats.
>>
>> Perhaps this is because not all patches for CachedStore have been merged
>> to Hive 3.1.3. For example, these patches are not merged. Or, there might
>> be some way to properly configure CachedStore so that it correctly computes
>> column stats.
>>
>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>> HIVE-21063: Support statistics in cachedStore for transactional table
>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>> constraint
>>
>> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>>
>> (If anyone is running Hive Metastore 3.1.3 in production with CachedStore
>> enabled, please let us know how you configure it.)
>>
>> 2.
>> Setting metastore.stats.fetch.bitvector=true can also help generate more
>> efficient query plans.
>>
>> --- Sungwoo
>>
>>
>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma 
>> wrote:
>>
>>> Hi Sungwoo Park,
>>>
>>> I'm sorry for the late reply to this old email.
>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>>> noticed that the response of the Hive3 MetaStore is very slow.
>>> We suspect that HIVE-14187 might be causing this slowness.
>>> Could you tell me if you have resolved this problem? Are there still any
>>> problems when you enable CachedStore?
>>>
>>> Regards,
>>> - Takanobu
>>>
>>> 2018年6月13日(水) 0:37 Sungwoo Park :
>>>
 Hello Hive users,

 I am experience a problem with MetaStore in Hive 3.0.

 1. Start MetaStore
 with 
 hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.

 2. Generate TPC-DS data.

 3. TPC-DS queries run okay and produce correct results. E.g., from
 query 1:
 +---+
 |   c_customer_id   |
 +---+
 | CHAA  |
 | DCAA  |
 | DDAA  |
 ...
 | AAAILIAA  |
 +---+
 100 rows selected (69.901 seconds)

 However, the query compilation takes long (
 https://issues.apache.org/jira/browse/HIVE-16520).

 4. Now, restart MetaStore with
 hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.

 5. TPC-DS queries run okay, but produce wrong results. E.g, from query
 1:
 ++
 | c_customer_id  |
 ++
 ++
 No rows selected (37.448 seconds)

 What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
 HiveServer2 produces such log messages:

 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
 tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
 tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
 tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] 

Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-28 Thread Pau Tallada
Hi,

We also had to disable CachedStore as it was producing wrong results in our
queries.
I'm sorry I cannot provide more detailed info.

Cheers,

Pau.

Missatge de Takanobu Asanuma  del dia dc., 28 de febr.
2024 a les 13:59:

> Thanks for your detailed answer!
>
> In the original email, you reported "the query compilation takes long" in
> Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
> Thank you for sharing the issue with CachedStore and the JIRA tickets.
> I will also try out metastore.stats.fetch.bitvector=true.
>
> Regards,
> - Takanobu
>
> 2024年2月28日(水) 18:49 Sungwoo Park :
>
>> Hello Takanobu,
>>
>> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
>> different, so I don't know why Metastore responses are very slow. I can
>> only share some results of testing CachedStore in Metastore. Please note
>> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
>> Hive 3.1.3 (which applies many additional patches).
>>
>> 1.
>> When CachedStore is enabled, column stats are not computed. As a result,
>> some queries generate very inefficient plans because of wrong/inaccurate
>> stats.
>>
>> Perhaps this is because not all patches for CachedStore have been merged
>> to Hive 3.1.3. For example, these patches are not merged. Or, there might
>> be some way to properly configure CachedStore so that it correctly computes
>> column stats.
>>
>> HIVE-20896: CachedStore fail to cache stats in multiple code paths
>> HIVE-21063: Support statistics in cachedStore for transactional table
>> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
>> constraint
>>
>> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>>
>> (If anyone is running Hive Metastore 3.1.3 in production with CachedStore
>> enabled, please let us know how you configure it.)
>>
>> 2.
>> Setting metastore.stats.fetch.bitvector=true can also help generate more
>> efficient query plans.
>>
>> --- Sungwoo
>>
>>
>> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma 
>> wrote:
>>
>>> Hi Sungwoo Park,
>>>
>>> I'm sorry for the late reply to this old email.
>>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>>> noticed that the response of the Hive3 MetaStore is very slow.
>>> We suspect that HIVE-14187 might be causing this slowness.
>>> Could you tell me if you have resolved this problem? Are there still any
>>> problems when you enable CachedStore?
>>>
>>> Regards,
>>> - Takanobu
>>>
>>> 2018年6月13日(水) 0:37 Sungwoo Park :
>>>
 Hello Hive users,

 I am experience a problem with MetaStore in Hive 3.0.

 1. Start MetaStore
 with 
 hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.

 2. Generate TPC-DS data.

 3. TPC-DS queries run okay and produce correct results. E.g., from
 query 1:
 +---+
 |   c_customer_id   |
 +---+
 | CHAA  |
 | DCAA  |
 | DDAA  |
 ...
 | AAAILIAA  |
 +---+
 100 rows selected (69.901 seconds)

 However, the query compilation takes long (
 https://issues.apache.org/jira/browse/HIVE-16520).

 4. Now, restart MetaStore with
 hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.

 5. TPC-DS queries run okay, but produce wrong results. E.g, from query
 1:
 ++
 | c_customer_id  |
 ++
 ++
 No rows selected (37.448 seconds)

 What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
 HiveServer2 produces such log messages:

 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
 tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
 tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
 tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
 tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
 tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
 c_customer_id
 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
 HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
 

Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-28 Thread Takanobu Asanuma
Thanks for your detailed answer!

In the original email, you reported "the query compilation takes long" in
Hive 3.0, but has this issue been resolved in your fork of Hive 3.1.3?
Thank you for sharing the issue with CachedStore and the JIRA tickets.
I will also try out metastore.stats.fetch.bitvector=true.

Regards,
- Takanobu

2024年2月28日(水) 18:49 Sungwoo Park :

> Hello Takanobu,
>
> We did not test with vanilla Hive 3.1.3 and Metastore databases can be
> different, so I don't know why Metastore responses are very slow. I can
> only share some results of testing CachedStore in Metastore. Please note
> that we did not use vanilla Hive 3.1.3 and instead used our own fork of
> Hive 3.1.3 (which applies many additional patches).
>
> 1.
> When CachedStore is enabled, column stats are not computed. As a result,
> some queries generate very inefficient plans because of wrong/inaccurate
> stats.
>
> Perhaps this is because not all patches for CachedStore have been merged
> to Hive 3.1.3. For example, these patches are not merged. Or, there might
> be some way to properly configure CachedStore so that it correctly computes
> column stats.
>
> HIVE-20896: CachedStore fail to cache stats in multiple code paths
> HIVE-21063: Support statistics in cachedStore for transactional table
> HIVE-24258: Data mismatch between CachedStore and ObjectStore for
> constraint
>
> So, we decided that CachedStore should not be enabled in Hive 3.1.3.
>
> (If anyone is running Hive Metastore 3.1.3 in production with CachedStore
> enabled, please let us know how you configure it.)
>
> 2.
> Setting metastore.stats.fetch.bitvector=true can also help generate more
> efficient query plans.
>
> --- Sungwoo
>
>
> On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma 
> wrote:
>
>> Hi Sungwoo Park,
>>
>> I'm sorry for the late reply to this old email.
>> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
>> noticed that the response of the Hive3 MetaStore is very slow.
>> We suspect that HIVE-14187 might be causing this slowness.
>> Could you tell me if you have resolved this problem? Are there still any
>> problems when you enable CachedStore?
>>
>> Regards,
>> - Takanobu
>>
>> 2018年6月13日(水) 0:37 Sungwoo Park :
>>
>>> Hello Hive users,
>>>
>>> I am experience a problem with MetaStore in Hive 3.0.
>>>
>>> 1. Start MetaStore
>>> with 
>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>>
>>> 2. Generate TPC-DS data.
>>>
>>> 3. TPC-DS queries run okay and produce correct results. E.g., from query
>>> 1:
>>> +---+
>>> |   c_customer_id   |
>>> +---+
>>> | CHAA  |
>>> | DCAA  |
>>> | DDAA  |
>>> ...
>>> | AAAILIAA  |
>>> +---+
>>> 100 rows selected (69.901 seconds)
>>>
>>> However, the query compilation takes long (
>>> https://issues.apache.org/jira/browse/HIVE-16520).
>>>
>>> 4. Now, restart MetaStore with
>>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>>
>>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from query 1:
>>> ++
>>> | c_customer_id  |
>>> ++
>>> ++
>>> No rows selected (37.448 seconds)
>>>
>>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
>>> HiveServer2 produces such log messages:
>>>
>>> 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>> 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>>> 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>> 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>>> 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>> c_customer_id
>>> 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>>> c_customer_id
>>>
>>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>>> Invalid column stats: No of nulls > cardinality
>>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>>> 

Re: CachedStore for hive.metastore.rawstore.impl in Hive 3.0

2024-02-28 Thread Sungwoo Park
Hello Takanobu,

We did not test with vanilla Hive 3.1.3 and Metastore databases can be
different, so I don't know why Metastore responses are very slow. I can
only share some results of testing CachedStore in Metastore. Please note
that we did not use vanilla Hive 3.1.3 and instead used our own fork of
Hive 3.1.3 (which applies many additional patches).

1.
When CachedStore is enabled, column stats are not computed. As a result,
some queries generate very inefficient plans because of wrong/inaccurate
stats.

Perhaps this is because not all patches for CachedStore have been merged to
Hive 3.1.3. For example, these patches are not merged. Or, there might be
some way to properly configure CachedStore so that it correctly computes
column stats.

HIVE-20896: CachedStore fail to cache stats in multiple code paths
HIVE-21063: Support statistics in cachedStore for transactional table
HIVE-24258: Data mismatch between CachedStore and ObjectStore for constraint

So, we decided that CachedStore should not be enabled in Hive 3.1.3.

(If anyone is running Hive Metastore 3.1.3 in production with CachedStore
enabled, please let us know how you configure it.)

2.
Setting metastore.stats.fetch.bitvector=true can also help generate more
efficient query plans.

--- Sungwoo


On Wed, Feb 28, 2024 at 1:40 PM Takanobu Asanuma 
wrote:

> Hi Sungwoo Park,
>
> I'm sorry for the late reply to this old email.
> We are attempting to upgrade Hive MetaStore from Hive1 to Hive3, and
> noticed that the response of the Hive3 MetaStore is very slow.
> We suspect that HIVE-14187 might be causing this slowness.
> Could you tell me if you have resolved this problem? Are there still any
> problems when you enable CachedStore?
>
> Regards,
> - Takanobu
>
> 2018年6月13日(水) 0:37 Sungwoo Park :
>
>> Hello Hive users,
>>
>> I am experience a problem with MetaStore in Hive 3.0.
>>
>> 1. Start MetaStore
>> with 
>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.ObjectStore.
>>
>> 2. Generate TPC-DS data.
>>
>> 3. TPC-DS queries run okay and produce correct results. E.g., from query
>> 1:
>> +---+
>> |   c_customer_id   |
>> +---+
>> | CHAA  |
>> | DCAA  |
>> | DDAA  |
>> ...
>> | AAAILIAA  |
>> +---+
>> 100 rows selected (69.901 seconds)
>>
>> However, the query compilation takes long (
>> https://issues.apache.org/jira/browse/HIVE-16520).
>>
>> 4. Now, restart MetaStore with
>> hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore.
>>
>> 5. TPC-DS queries run okay, but produce wrong results. E.g, from query 1:
>> ++
>> | c_customer_id  |
>> ++
>> ++
>> No rows selected (37.448 seconds)
>>
>> What I noticed is that with hive.metastore.rawstore.impl=CachedStore,
>> HiveServer2 produces such log messages:
>>
>> 2018-06-12T23:50:04,223  WARN [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>> 2018-06-12T23:50:04,223  INFO [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>> tpcds_bin_partitioned_orc_1000@date_dim, Columns: d_date_sk, d_year
>> 2018-06-12T23:50:04,225  WARN [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>> 2018-06-12T23:50:04,225  INFO [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>> tpcds_bin_partitioned_orc_1000@store, Columns: s_state, s_store_sk
>> 2018-06-12T23:50:04,226  WARN [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] calcite.RelOptHiveTable: No Stats for
>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>> c_customer_id
>> 2018-06-12T23:50:04,226  INFO [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] SessionState: No Stats for
>> tpcds_bin_partitioned_orc_1000@customer, Columns: c_customer_sk,
>> c_customer_id
>>
>> 2018-06-12T23:50:05,158 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>> Invalid column stats: No of nulls > cardinality
>> 2018-06-12T23:50:05,159 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>> Invalid column stats: No of nulls > cardinality
>> 2018-06-12T23:50:05,160 ERROR [b3041385-0290-492f-aef8-c0249de328ad
>> HiveServer2-Handler-Pool: Thread-59] annotation.StatsRulesProcFactory:
>> Invalid column stats: No of nulls > cardinality
>>
>> However, even after computing column stats, queries still return wrong
>> results, despite the fact that the above log messages disappear.
>>
>> I guess I am missing some configuration parameters