subject:"\[jira\] \[Commented\] \(HIVE\-19418\) add background stats updater similar to compactor"

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-29 Thread Peter Vary (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16527294#comment-16527294
 ] 

Peter Vary commented on HIVE-19418:
---

[~sershe]: But of course. Reverting the whole change was never my intention :). 
In Hungary we say: Do not throw out the baby with the bathwater :)

Filed: HIVE-20034

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-28 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526670#comment-16526670
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Sure, works for me. I feel these calls (with nulls, etc) are silly enough that 
noone should be intentionally handling exceptions for them, so it can throw 
anything.
Please don't revert the functionality of this patch though...

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-28 Thread Alan Gates (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526553#comment-16526553
 ] 

Alan Gates commented on HIVE-19418:
---

I prefer to keep the exceptions the same for backwards compatibility, so I'd be 
in favor of changing it back.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-28 Thread Peter Vary (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526097#comment-16526097
 ] 

Peter Vary commented on HIVE-19418:
---

[~sershe], [~alangates]: As per our discussion on hive-dev list (See: [Apache 
dev list 
archive|http://mail-archives.apache.org/mod_mbox/hive-dev/201712.mbox/%3CCDF09DF1-746E-4A4F-8644-8B441F386937%40cloudera.com%3E]),
 I think the consensus was that we would like to keep the original exceptions 
for the MetaStore Thrift API.

Shall I create a new patch which reverts back the part of this change where in 
{{HiveMetaStoreClient.createTable}} we started to throw 
{{InvalidOperationException}} instead of the original {{MetaException}}. Or we 
see enough reasons to reopen the discussion?
 Thanks,
 Peter

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-25 Thread Peter Vary (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521996#comment-16521996
 ] 

Peter Vary commented on HIVE-19418:
---

Hi [~sershe],
I see that this patch changed the Exception thrown by 
{{HiveMetaStoreClient.alter_table}} and {{HiveMetaStoreClient.createTable}} 
from {{MetaException}} to {{InvalidOperationException}}. Is this an intentional 
change? I am asking because on the dev list there was a discussion about 
cleaning up the exception handling of the MetaStore thrift interface, but we 
finally decided against it so we can keep backward compatibility. Do we have 
different plans for 4.0.0?
Thanks,
Peter

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-12 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510136#comment-16510136
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Fixed

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-12 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510038#comment-16510038
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Let me fix in an addendum commit on branch-3.
We've increased timeouts for TestStatsUpdaterThread, however it may be that for 
one of the tests the timeout needs to be increased further.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-11 Thread Alisha Prabhu (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509188#comment-16509188
 ] 

Alisha Prabhu commented on HIVE-19418:
--

Hi [~sershe], [~kgyrtkirk] , 
I was able to reproduce the failure for TestStatsUpdaterThread on our local 
environment(x86, ppc64le). The test case has passed after increasing the 
timeout.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-11 Thread Vineet Garg (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509120#comment-16509120
 ] 

Vineet Garg commented on HIVE-19418:


[~sershe] This patch in branch-3 is causing following failures (I have 
confirmed it using git bisect)
* 
org.apache.hadoop.hive.metastore.client.TestFunctions.testCreateFunctionNullDatabaseName[Embedded]
* 
org.apache.hadoop.hive.metastore.client.TestFunctions.testCreateFunctionNullDatabaseName[Remote]

Can you please fix these tests or revert your patch from branch-3? I don't 
understand why this was pushed to branch-3 without a test run.

Ref: https://builds.apache.org/job/PreCommit-HIVE-Build/11694/testReport/

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-07 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505084#comment-16505084
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Looks like it was making progress when it died, the metastore was just slow.
I'll increase timeouts in addendum patch.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-07 Thread Zoltan Haindrich (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504359#comment-16504359
 ] 

Zoltan Haindrich commented on HIVE-19418:
-

org.apache.hadoop.hive.ql.stats.TestStatsUpdaterThread have failed ; but I was 
not able to reproduce it (run it >3 times)
HIVE-19237 ; 
https://builds.apache.org/job/PreCommit-HIVE-Build/11575/testReport/

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.1.0, 4.0.0
>
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-06 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503568#comment-16503568
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12926621/HIVE-19418.07.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 14476 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11562/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11562/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11562/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12926621 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.07.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-06 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503531#comment-16503531
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
31s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
27s{color} | {color:blue} ql in master has 2280 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
32s{color} | {color:blue} standalone-metastore in master has 214 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
45s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
16s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
26s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m  
9s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
33s{color} | {color:red} ql: The patch generated 32 new + 120 unchanged - 0 
fixed = 152 total (was 120) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
25s{color} | {color:red} standalone-metastore: The patch generated 16 new + 
1470 unchanged - 0 fixed = 1486 total (was 1470) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m 
34s{color} | {color:red} ql generated 4 new + 2280 unchanged - 0 fixed = 2284 
total (was 2280) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
51s{color} | {color:red} standalone-metastore generated 2 new + 213 unchanged - 
1 fixed = 215 total (was 214) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
11s{color} | {color:red} The patch generated 2 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 31m  7s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At StatsUpdaterThread.java:[line 221] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 501] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 642] |
|  |  org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.buildPartColStr(Table)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-05 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502295#comment-16502295
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12926449/HIVE-19418.07.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 14475 tests 
executed
*Failed tests:*
{noformat}
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerCustomCreatedDynamicPartitions
 (batchId=241)
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerCustomCreatedDynamicPartitionsMultiInsert
 (batchId=241)
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerCustomCreatedDynamicPartitionsUnionAll
 (batchId=241)
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerCustomNonExistent 
(batchId=241)
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerHighBytesRead 
(batchId=241)
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerHighShuffleBytes 
(batchId=241)
org.apache.hive.jdbc.TestTriggersWorkloadManager.testTriggerVertexRawInputSplitsNoKill
 (batchId=241)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11529/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11529/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11529/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12926449 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-05 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502238#comment-16502238
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
25s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 9s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
22s{color} | {color:blue} ql in master has 2280 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
37s{color} | {color:blue} standalone-metastore in master has 214 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
16s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
27s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m  
9s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
33s{color} | {color:red} ql: The patch generated 32 new + 120 unchanged - 0 
fixed = 152 total (was 120) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
27s{color} | {color:red} standalone-metastore: The patch generated 16 new + 
1470 unchanged - 0 fixed = 1486 total (was 1470) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m 
39s{color} | {color:red} ql generated 4 new + 2280 unchanged - 0 fixed = 2284 
total (was 2280) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
42s{color} | {color:red} standalone-metastore generated 2 new + 213 unchanged - 
1 fixed = 215 total (was 214) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
11s{color} | {color:red} The patch generated 2 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 30m 17s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At StatsUpdaterThread.java:[line 221] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 501] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 642] |
|  |  org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.buildPartColStr(Table)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-04 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500844#comment-16500844
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Cannot repro this after a few runs and the logs are gone. Trying again...

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.07.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-03 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499470#comment-16499470
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12926175/HIVE-19418.06.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11477/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11477/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11477/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Tests exited with: Exception: Patch URL 
https://issues.apache.org/jira/secure/attachment/12926175/HIVE-19418.06.patch 
was found in seen patch url's cache and a test was probably run already on it. 
Aborting...
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12926175 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-02 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499119#comment-16499119
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12926175/HIVE-19418.06.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 14446 tests 
executed
*Failed tests:*
{noformat}
TestActivePassiveHA - did not produce a TEST-*.xml file (likely timed out) 
(batchId=242)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11444/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11444/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11444/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12926175 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.06.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-02 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499116#comment-16499116
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
13s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
26s{color} | {color:blue} ql in master has 2278 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
35s{color} | {color:blue} standalone-metastore in master has 214 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
49s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
17s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
28s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m  
9s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
35s{color} | {color:red} ql: The patch generated 32 new + 120 unchanged - 0 
fixed = 152 total (was 120) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
27s{color} | {color:red} standalone-metastore: The patch generated 16 new + 
1470 unchanged - 0 fixed = 1486 total (was 1470) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m 
41s{color} | {color:red} ql generated 4 new + 2278 unchanged - 0 fixed = 2282 
total (was 2278) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
52s{color} | {color:red} standalone-metastore generated 2 new + 213 unchanged - 
1 fixed = 215 total (was 214) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 31m 26s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At StatsUpdaterThread.java:[line 221] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 501] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 642] |
|  |

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-01 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498571#comment-16498571
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Updated the test, extra null checks in the helper caused some exception types 
to change.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.06.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-01 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498165#comment-16498165
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12925979/HIVE-19418.05.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 14453 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testAlterTableNullDatabaseInNew[Embedded]
 (batchId=211)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testAlterTableNullDatabaseInNew[Remote]
 (batchId=211)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Embedded]
 (batchId=211)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Remote]
 (batchId=211)
org.apache.hive.jdbc.TestJdbcWithMiniLlapArrow.testDataTypes (batchId=242)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11414/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11414/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11414/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12925979 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.05.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-06-01 Thread Hive QA (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498136#comment-16498136
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
38s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
46s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
54s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
36s{color} | {color:blue} ql in master has 2278 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
54s{color} | {color:blue} standalone-metastore in master has 214 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
55s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
17s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
31s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
10s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
36s{color} | {color:red} ql: The patch generated 32 new + 120 unchanged - 0 
fixed = 152 total (was 120) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
29s{color} | {color:red} standalone-metastore: The patch generated 16 new + 
1453 unchanged - 0 fixed = 1469 total (was 1453) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m 
52s{color} | {color:red} ql generated 4 new + 2278 unchanged - 0 fixed = 2282 
total (was 2278) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
57s{color} | {color:red} standalone-metastore generated 2 new + 213 unchanged - 
1 fixed = 215 total (was 214) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
54s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 33m 26s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(MetaStoreUtils$FullTableName)
  At StatsUpdaterThread.java:[line 221] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 501] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 642] |
|  |

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-31 Thread Ashutosh Chauhan (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496690#comment-16496690
 ] 

Ashutosh Chauhan commented on HIVE-19418:
-

+1

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.04.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-30 Thread Sergey Shelukhin (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496000#comment-16496000
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Removed string concatenation

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.04.patch, HIVE-19418.04.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-28 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492700#comment-16492700
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12925235/HIVE-19418.03.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 14416 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Embedded]
 (batchId=210)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Remote]
 (batchId=210)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11297/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11297/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11297/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12925235 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-28 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492668#comment-16492668
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
12s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
50s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
29s{color} | {color:blue} ql in master has 2324 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
47s{color} | {color:blue} standalone-metastore in master has 216 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
49s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
17s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
29s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m  
9s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
35s{color} | {color:red} ql: The patch generated 32 new + 120 unchanged - 0 
fixed = 152 total (was 120) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
27s{color} | {color:red} standalone-metastore: The patch generated 10 new + 
1408 unchanged - 0 fixed = 1418 total (was 1408) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m 
47s{color} | {color:red} ql generated 4 new + 2324 unchanged - 0 fixed = 2328 
total (was 2324) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
49s{color} | {color:red} standalone-metastore generated 2 new + 215 unchanged - 
1 fixed = 217 total (was 216) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 31m 46s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(String)  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(String)
  At StatsUpdaterThread.java:[line 223] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 503] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 651] |
|  |

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-25 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491451#comment-16491451
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Rebased and updated the patch

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.03.patch, HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-24 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488825#comment-16488825
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12924666/HIVE-19418.02.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 14388 tests 
executed
*Failed tests:*
{noformat}
TestJdbcNonKrbSASLWithMiniKdc - did not produce a TEST-*.xml file (likely timed 
out) (batchId=255)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Embedded]
 (batchId=210)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Remote]
 (batchId=210)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11179/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11179/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11179/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12924666 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-24 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488802#comment-16488802
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
37s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
16s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
51s{color} | {color:blue} ql in master has 2322 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
52s{color} | {color:blue} standalone-metastore in master has 216 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
13s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
17s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
31s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
11s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
39s{color} | {color:red} ql: The patch generated 31 new + 120 unchanged - 0 
fixed = 151 total (was 120) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
27s{color} | {color:red} standalone-metastore: The patch generated 11 new + 
1407 unchanged - 0 fixed = 1418 total (was 1407) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  4m  
5s{color} | {color:red} ql generated 4 new + 2322 unchanged - 0 fixed = 2326 
total (was 2322) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m  
5s{color} | {color:red} standalone-metastore generated 2 new + 215 unchanged - 
1 fixed = 217 total (was 216) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 35m 21s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(String)  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(String)
  At StatsUpdaterThread.java:[line 226] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 506] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 654] |
|  |

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-22 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486532#comment-16486532
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Rebased the patch (no conflicts, just some offset changes) to run HiveQA again

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-17 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479621#comment-16479621
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

Test failures looks bogus. Given the state of HiveQA it makes no sense to rerun 
for now, I'm assuming there will anyway be some changes based on the review.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-17 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478717#comment-16478717
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12923843/HIVE-19418.02.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 14416 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[union_stats]
 (batchId=159)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Embedded]
 (batchId=209)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testCreateTableNullDatabase[Remote]
 (batchId=209)
org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testMapWithComplexData[5]
 (batchId=196)
org.apache.hive.hcatalog.pig.TestTextFileHCatStorer.testWriteDate2 (batchId=196)
org.apache.hive.hcatalog.pig.TestTextFileHCatStorer.testWriteTinyint 
(batchId=196)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11021/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11021/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11021/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12923843 - PreCommit-HIVE-Build

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.01.patch, HIVE-19418.02.patch, 
> HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-17 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478684#comment-16478684
 ] 

Hive QA commented on HIVE-19418:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
39s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
41s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
16s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
49s{color} | {color:blue} ql in master has 2320 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
52s{color} | {color:blue} standalone-metastore in master has 215 extant 
Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
15s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
18s{color} | {color:red} hcatalog-unit in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
28s{color} | {color:red} ql in the patch failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
0s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
10s{color} | {color:red} itests/hcatalog-unit: The patch generated 1 new + 27 
unchanged - 0 fixed = 28 total (was 27) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
38s{color} | {color:red} ql: The patch generated 31 new + 119 unchanged - 0 
fixed = 150 total (was 119) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
27s{color} | {color:red} standalone-metastore: The patch generated 11 new + 
1407 unchanged - 0 fixed = 1418 total (was 1407) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  4m  
5s{color} | {color:red} ql generated 4 new + 2320 unchanged - 0 fixed = 2324 
total (was 2320) {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  3m  
5s{color} | {color:red} standalone-metastore generated 2 new + 214 unchanged - 
1 fixed = 216 total (was 215) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 35m  7s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Dead store to writeIds in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(String)  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.processOneTable(String)
  At StatsUpdaterThread.java:[line 226] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.markAnalyzeDone(StatsUpdaterThread$AnalyzeWork)
  At StatsUpdaterThread.java:[line 506] |
|  |  Synchronization performed on java.util.concurrent.atomic.AtomicInteger in 
org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()  At 
StatsUpdaterThread.java:org.apache.hadoop.hive.ql.stats.StatsUpdaterThread.waitForQueuedCommands()
  At StatsUpdaterThread.java:[line 654] |
|  |

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-16 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478268#comment-16478268
 ] 

Hive QA commented on HIVE-19418:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12923755/HIVE-19418.01.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/11002/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11002/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11002/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2018-05-17 00:09:36.953
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-11002/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2018-05-17 00:09:36.956
+ cd apache-github-source-source
+ git fetch origin
+ git reset --hard HEAD
HEAD is now at b329afa HIVE-19572: Add option to mask stats and data size in q 
files (Jesus Camacho Rodriguez, reviewed by Prasanth Jayachandran) (addendum)
+ git clean -f -d
Removing ${project.basedir}/
+ git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
+ git reset --hard origin/master
HEAD is now at b329afa HIVE-19572: Add option to mask stats and data size in q 
files (Jesus Camacho Rodriguez, reviewed by Prasanth Jayachandran) (addendum)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2018-05-17 00:09:38.075
+ rm -rf ../yetus_PreCommit-HIVE-Build-11002
+ mkdir ../yetus_PreCommit-HIVE-Build-11002
+ git gc
+ cp -R . ../yetus_PreCommit-HIVE-Build-11002
+ mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-11002/yetus
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
Going to apply patch with: git apply -p0
/data/hiveptest/working/scratch/build.patch:471: trailing whitespace.
 
/data/hiveptest/working/scratch/build.patch:1481: trailing whitespace.
 
warning: 2 lines add whitespace errors.
+ [[ maven == \m\a\v\e\n ]]
+ rm -rf /data/hiveptest/working/maven/org/apache/hive
+ mvn -B clean install -DskipTests -T 4 -q 
-Dmaven.repo.local=/data/hiveptest/working/maven
protoc-jar: executing: [/tmp/protoc8726333768938589461.exe, --version]
protoc-jar: executing: [/tmp/protoc8726333768938589461.exe, 
-I/data/hiveptest/working/apache-github-source-source/standalone-metastore/src/main/protobuf/org/apache/hadoop/hive/metastore,
 
--java_out=/data/hiveptest/working/apache-github-source-source/standalone-metastore/target/generated-sources,
 
/data/hiveptest/working/apache-github-source-source/standalone-metastore/src/main/protobuf/org/apache/hadoop/hive/metastore/metastore.proto]
libprotoc 2.5.0
ANTLR Parser Generator  Version 3.5.2
Output file 
/data/hiveptest/working/apache-github-source-source/standalone-metastore/target/generated-sources/org/apache/hadoop/hive/metastore/parser/FilterParser.java
 does not exist: must build 
/data/hiveptest/working/apache-github-source-source/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/parser/Filter.g
org/apache/hadoop/hive/metastore/parser/Filter.g
log4j:WARN No appenders could be found for logger (DataNucleus.Persistence).
log4j:WARN Please initialize the log4j system properly.
DataNucleus Enhancer (version 4.1.17) for API "JDO"
DataNucleus Enhancer completed with success for 40 classes.
ANTLR Parser Generator  Version 3.5.2
Output file 
/data/hiveptest/working/apache-github-source-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveLexer.java
 does not exist: must build

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-14 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475284#comment-16475284
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

This should eventually integrate with ACID stats to determine what stats are 
out of date, when that is done. Probably in separate jira if this goes in first.
[~ashutoshc] can you review? This is metastore/stats related mostly for now.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HIVE-19418.patch
>
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-07 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466290#comment-16466290
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

 HIVE-19442 for phase 4

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-04 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464272#comment-16464272
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

HyperLogLog is not precise; ndv is used for count distinct queries, that cannot 
be merged.
W.r.t. background logic yeah, [~alangates] had some plans for compactor, we'd 
have to do the same for this one. I'm assuming we'll move them into HS2, 
however for load balanced case there'd need to be a primary hosting them.

There's also phase 4 btw - where all the updates write increments, instead of 
total stats, to metastore... then, we can get stats state with one DB query, 
and also "compact" stats.

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-03 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463257#comment-16463257
 ] 

Gopal V commented on HIVE-19418:


bq. some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
ndvs from two inserts)

For pure insert queries all stats can be merged - because nDVs are actually 
stored as HyperLogLog bitsets which have a merge() op.

bq. herefore we will add background logic to metastore (similar to, and 
partially inside, the ACID compactor)

With standalone-metastore, adding more background logic to the metastore is 
going to become a big problem - I'd argue that even the compactor need to be 
moved out & the metastore can only keep the book-keeping for pending tasks (a 
generic task queue + priorities) because it will no longer have a yarn-site.xml 
in its configurations.


> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-19418) add background stats updater similar to compactor

2018-05-03 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463241#comment-16463241
 ] 

Sergey Shelukhin commented on HIVE-19418:
-

cc [~steveyeom2017] [~ekoifman] [~ashutoshc]

> add background stats updater similar to compactor
> -
>
> Key: HIVE-19418
> URL: https://issues.apache.org/jira/browse/HIVE-19418
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>
> There's a JIRA HIVE-19416 to add snapshot version to stats for MM/ACID tables 
> to make them usable in a transaction without breaking ACID (for metadata-only 
> optimization). However, stats for ACID tables can still become unusable if 
> e.g. two parallel inserts run - neither sees the data written by the other, 
> so after both finish, the snapshots on either set of stats won't match the 
> current snapshot and the stats will be unusable.
> Additionally, for ACID and non-ACID tables alike, a lot of the stats, with 
> some exceptions like numRows, cannot be aggregated (i.e. you cannot combine 
> ndvs from two inserts), and for ACID even less can be aggregated (you cannot 
> derive min/max if some rows are deleted but you don't scan the rest of the 
> dataset).
> Therefore we will add background logic to metastore (similar to, and 
> partially inside, the ACID compactor) to update stats.
> It will have 3 modes of operation.
> 1) Off.
> 2) Update only the stats that exist but are out of date (generating stats can 
> be expensive, so if the user is only analyzing a subset of tables it should 
> be able to only update that subset). We can simply look at existing stats and 
> only analyze for the relevant partitions and columns.
> 3) On: 2 + create stats for all tables and columns missing stats.
> There will also be a table parameter to skip stats update. 
> In phase 1, the process will operate outside of compactor, and run analyze 
> command on the table. The analyze command will automatically save the stats 
> with ACID snapshot information if needed, based on HIVE-19416, so we don't 
> need to do any special state management and this will work for all table 
> types. However it's also more expensive.
> In phase 2, we can explore adding stats collection during MM compaction that 
> uses a temp table. If we don't have open writers during major compaction (so 
> we overwrite all of the data), the temp table stats can simply be copied over 
> to the main table with correct snapshot information, saving us a table scan.
> In phase 3, we can add custom stats collection logic to full ACID compactor 
> that is not query based, the same way as we'd do for (2). Alternatively we 
> can wait for ACID compactor to become query based and just reuse (2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

39 matches

Mail list logo