[jira] [Updated] (PHOENIX-6081) Improvements to snapshot based MR input format

2021-05-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated PHOENIX-6081:
--
Fix Version/s: (was: 4.16.1)
   4.16.2

> Improvements to snapshot based MR input format
> --
>
> Key: PHOENIX-6081
> URL: https://issues.apache.org/jira/browse/PHOENIX-6081
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 5.0.0, 4.14.3
>Reporter: Bharath Vissapragada
>Priority: Major
> Fix For: 4.17.0, 4.16.2
>
>
> Recently we switched an MR application from scanning live tables to scanning 
> snapshots (PHOENIX-3744). We ran into a severe performance issue, which 
> turned out to a correctness issue due to over-lapping scan splits generation. 
> After some debugging we figured that it has been fixed via PHOENIX-4997. Even 
> with that fix there are quite a few things we could improve about the 
> snapshot based input format. Listing them here, perhaps we can break them 
> into subtasks as needed.
> - Do not restore the snapshot per map task. Currently we restore the snapshot 
> once per map task into a temp directory. For large tables on big clusters, 
> this creates a storm of NN RPCs. We can do this once per job and let all the 
> map tasks operate on the same restored snapshot. HBase already did this via 
> HBASE-18806, we can do something similar.
> - Disable 
> [cacheBlocks|[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCacheBlocks-boolean-]]
>  on scans generated by input format. In our experiments block cache took a 
> lot of memory for MR jobs. For full table scans this isn't of much use and 
> can save a lot of memory.
> - Short circuit live-table codepaths when snapshots are enabled. Currently 
> some codepaths make live table HBase RPCs to get a bunch of data. For example
> {noformat}
> private List generateSplits(final QueryPlan qplan, Configuration 
> config) throws IOException {
> // We must call this in order to initialize the scans and splits from the 
> query plan
>   
> // Get the RegionSizeCalculator
> try(org.apache.hadoop.hbase.client.Connection connection =
> 
> HBaseFactoryProvider.getHConnectionFactory().createConnection(config)) {
> RegionLocator regionLocator = 
> connection.getRegionLocator(TableName.valueOf(tableName));
> RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(regionLocator, 
> connection
> .getAdmin()); {noformat}
> This defeats the purpose of using snapshots. Refactor the code in a way that 
> the snapshot based codepaths do minimal HBase RPCs and rely solely on 
> snapshot manifest. Even better, improve locality of task scheduling based on 
> snapshot's hfile block locations.
> - Disable indexes for query plan for scanning over snapshots. If there is an 
> index based access path, getScans() can potentially return index based splits 
> which is not what we want for snapshots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-6081) Improvements to snapshot based MR input format

2020-10-27 Thread Chinmay Kulkarni (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinmay Kulkarni updated PHOENIX-6081:
--
Affects Version/s: (was: master)
   5.0.0

> Improvements to snapshot based MR input format
> --
>
> Key: PHOENIX-6081
> URL: https://issues.apache.org/jira/browse/PHOENIX-6081
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 5.0.0, 4.14.3
>Reporter: Bharath Vissapragada
>Priority: Major
>
> Recently we switched an MR application from scanning live tables to scanning 
> snapshots (PHOENIX-3744). We ran into a severe performance issue, which 
> turned out to a correctness issue due to over-lapping scan splits generation. 
> After some debugging we figured that it has been fixed via PHOENIX-4997. Even 
> with that fix there are quite a few things we could improve about the 
> snapshot based input format. Listing them here, perhaps we can break them 
> into subtasks as needed.
> - Do not restore the snapshot per map task. Currently we restore the snapshot 
> once per map task into a temp directory. For large tables on big clusters, 
> this creates a storm of NN RPCs. We can do this once per job and let all the 
> map tasks operate on the same restored snapshot. HBase already did this via 
> HBASE-18806, we can do something similar.
> - Disable 
> [cacheBlocks|[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCacheBlocks-boolean-]]
>  on scans generated by input format. In our experiments block cache took a 
> lot of memory for MR jobs. For full table scans this isn't of much use and 
> can save a lot of memory.
> - Short circuit live-table codepaths when snapshots are enabled. Currently 
> some codepaths make live table HBase RPCs to get a bunch of data. For example
> {noformat}
> private List generateSplits(final QueryPlan qplan, Configuration 
> config) throws IOException {
> // We must call this in order to initialize the scans and splits from the 
> query plan
>   
> // Get the RegionSizeCalculator
> try(org.apache.hadoop.hbase.client.Connection connection =
> 
> HBaseFactoryProvider.getHConnectionFactory().createConnection(config)) {
> RegionLocator regionLocator = 
> connection.getRegionLocator(TableName.valueOf(tableName));
> RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(regionLocator, 
> connection
> .getAdmin()); {noformat}
> This defeats the purpose of using snapshots. Refactor the code in a way that 
> the snapshot based codepaths do minimal HBase RPCs and rely solely on 
> snapshot manifest. Even better, improve locality of task scheduling based on 
> snapshot's hfile block locations.
> - Disable indexes for query plan for scanning over snapshots. If there is an 
> index based access path, getScans() can potentially return index based splits 
> which is not what we want for snapshots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-6081) Improvements to snapshot based MR input format

2020-10-27 Thread Chinmay Kulkarni (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinmay Kulkarni updated PHOENIX-6081:
--
Fix Version/s: 4.16.1

> Improvements to snapshot based MR input format
> --
>
> Key: PHOENIX-6081
> URL: https://issues.apache.org/jira/browse/PHOENIX-6081
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 5.0.0, 4.14.3
>Reporter: Bharath Vissapragada
>Priority: Major
> Fix For: 4.16.1
>
>
> Recently we switched an MR application from scanning live tables to scanning 
> snapshots (PHOENIX-3744). We ran into a severe performance issue, which 
> turned out to a correctness issue due to over-lapping scan splits generation. 
> After some debugging we figured that it has been fixed via PHOENIX-4997. Even 
> with that fix there are quite a few things we could improve about the 
> snapshot based input format. Listing them here, perhaps we can break them 
> into subtasks as needed.
> - Do not restore the snapshot per map task. Currently we restore the snapshot 
> once per map task into a temp directory. For large tables on big clusters, 
> this creates a storm of NN RPCs. We can do this once per job and let all the 
> map tasks operate on the same restored snapshot. HBase already did this via 
> HBASE-18806, we can do something similar.
> - Disable 
> [cacheBlocks|[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCacheBlocks-boolean-]]
>  on scans generated by input format. In our experiments block cache took a 
> lot of memory for MR jobs. For full table scans this isn't of much use and 
> can save a lot of memory.
> - Short circuit live-table codepaths when snapshots are enabled. Currently 
> some codepaths make live table HBase RPCs to get a bunch of data. For example
> {noformat}
> private List generateSplits(final QueryPlan qplan, Configuration 
> config) throws IOException {
> // We must call this in order to initialize the scans and splits from the 
> query plan
>   
> // Get the RegionSizeCalculator
> try(org.apache.hadoop.hbase.client.Connection connection =
> 
> HBaseFactoryProvider.getHConnectionFactory().createConnection(config)) {
> RegionLocator regionLocator = 
> connection.getRegionLocator(TableName.valueOf(tableName));
> RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(regionLocator, 
> connection
> .getAdmin()); {noformat}
> This defeats the purpose of using snapshots. Refactor the code in a way that 
> the snapshot based codepaths do minimal HBase RPCs and rely solely on 
> snapshot manifest. Even better, improve locality of task scheduling based on 
> snapshot's hfile block locations.
> - Disable indexes for query plan for scanning over snapshots. If there is an 
> index based access path, getScans() can potentially return index based splits 
> which is not what we want for snapshots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-6081) Improvements to snapshot based MR input format

2020-10-27 Thread Chinmay Kulkarni (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinmay Kulkarni updated PHOENIX-6081:
--
Fix Version/s: 4.17.0

> Improvements to snapshot based MR input format
> --
>
> Key: PHOENIX-6081
> URL: https://issues.apache.org/jira/browse/PHOENIX-6081
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 5.0.0, 4.14.3
>Reporter: Bharath Vissapragada
>Priority: Major
> Fix For: 4.16.1, 4.17.0
>
>
> Recently we switched an MR application from scanning live tables to scanning 
> snapshots (PHOENIX-3744). We ran into a severe performance issue, which 
> turned out to a correctness issue due to over-lapping scan splits generation. 
> After some debugging we figured that it has been fixed via PHOENIX-4997. Even 
> with that fix there are quite a few things we could improve about the 
> snapshot based input format. Listing them here, perhaps we can break them 
> into subtasks as needed.
> - Do not restore the snapshot per map task. Currently we restore the snapshot 
> once per map task into a temp directory. For large tables on big clusters, 
> this creates a storm of NN RPCs. We can do this once per job and let all the 
> map tasks operate on the same restored snapshot. HBase already did this via 
> HBASE-18806, we can do something similar.
> - Disable 
> [cacheBlocks|[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCacheBlocks-boolean-]]
>  on scans generated by input format. In our experiments block cache took a 
> lot of memory for MR jobs. For full table scans this isn't of much use and 
> can save a lot of memory.
> - Short circuit live-table codepaths when snapshots are enabled. Currently 
> some codepaths make live table HBase RPCs to get a bunch of data. For example
> {noformat}
> private List generateSplits(final QueryPlan qplan, Configuration 
> config) throws IOException {
> // We must call this in order to initialize the scans and splits from the 
> query plan
>   
> // Get the RegionSizeCalculator
> try(org.apache.hadoop.hbase.client.Connection connection =
> 
> HBaseFactoryProvider.getHConnectionFactory().createConnection(config)) {
> RegionLocator regionLocator = 
> connection.getRegionLocator(TableName.valueOf(tableName));
> RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(regionLocator, 
> connection
> .getAdmin()); {noformat}
> This defeats the purpose of using snapshots. Refactor the code in a way that 
> the snapshot based codepaths do minimal HBase RPCs and rely solely on 
> snapshot manifest. Even better, improve locality of task scheduling based on 
> snapshot's hfile block locations.
> - Disable indexes for query plan for scanning over snapshots. If there is an 
> index based access path, getScans() can potentially return index based splits 
> which is not what we want for snapshots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PHOENIX-6081) Improvements to snapshot based MR input format

2020-10-27 Thread Chinmay Kulkarni (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-6081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinmay Kulkarni updated PHOENIX-6081:
--
Affects Version/s: (was: 4.15.1)

> Improvements to snapshot based MR input format
> --
>
> Key: PHOENIX-6081
> URL: https://issues.apache.org/jira/browse/PHOENIX-6081
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 4.14.3, master
>Reporter: Bharath Vissapragada
>Priority: Major
>
> Recently we switched an MR application from scanning live tables to scanning 
> snapshots (PHOENIX-3744). We ran into a severe performance issue, which 
> turned out to a correctness issue due to over-lapping scan splits generation. 
> After some debugging we figured that it has been fixed via PHOENIX-4997. Even 
> with that fix there are quite a few things we could improve about the 
> snapshot based input format. Listing them here, perhaps we can break them 
> into subtasks as needed.
> - Do not restore the snapshot per map task. Currently we restore the snapshot 
> once per map task into a temp directory. For large tables on big clusters, 
> this creates a storm of NN RPCs. We can do this once per job and let all the 
> map tasks operate on the same restored snapshot. HBase already did this via 
> HBASE-18806, we can do something similar.
> - Disable 
> [cacheBlocks|[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCacheBlocks-boolean-]]
>  on scans generated by input format. In our experiments block cache took a 
> lot of memory for MR jobs. For full table scans this isn't of much use and 
> can save a lot of memory.
> - Short circuit live-table codepaths when snapshots are enabled. Currently 
> some codepaths make live table HBase RPCs to get a bunch of data. For example
> {noformat}
> private List generateSplits(final QueryPlan qplan, Configuration 
> config) throws IOException {
> // We must call this in order to initialize the scans and splits from the 
> query plan
>   
> // Get the RegionSizeCalculator
> try(org.apache.hadoop.hbase.client.Connection connection =
> 
> HBaseFactoryProvider.getHConnectionFactory().createConnection(config)) {
> RegionLocator regionLocator = 
> connection.getRegionLocator(TableName.valueOf(tableName));
> RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(regionLocator, 
> connection
> .getAdmin()); {noformat}
> This defeats the purpose of using snapshots. Refactor the code in a way that 
> the snapshot based codepaths do minimal HBase RPCs and rely solely on 
> snapshot manifest. Even better, improve locality of task scheduling based on 
> snapshot's hfile block locations.
> - Disable indexes for query plan for scanning over snapshots. If there is an 
> index based access path, getScans() can potentially return index based splits 
> which is not what we want for snapshots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)