subject:"\[jira\] \[Comment Edited\] \(HBASE\-18872\) Backup scaling for multiple table and millions of row"

[jira] [Comment Edited] (HBASE-18872) Backup scaling for multiple table and millions of row

2021-01-19 Thread Mallikarjun (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268322#comment-17268322
 ] 

Mallikarjun edited comment on HBASE-18872 at 1/20/21, 2:13 AM:
---

Moving this from B testing to Backup/Restore Phase 4. [~vrodionov] where is 
it is more suitable. [~vishk] FYI


was (Author: rda3mon):
Moving this from B testing to Backup/Restore Phase 4. [~vrodionov] where is 
it is more suitable. [~vishk] FYI
 * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13104721]

> Backup scaling for multiple table and millions of row
> -
>
> Key: HBASE-18872
> URL: https://issues.apache.org/jira/browse/HBASE-18872
> Project: HBase
>  Issue Type: Improvement
>Reporter: Vishal Khandelwal
>Assignee: Vladimir Rodionov
>Priority: Major
>
> I did a simple experiment of loading ~200 million rows on a table 1 and 
> nothing in a table 2. This test was done on a local cluster ~ approx 3-4 
> containers were running in parallel. The focus of the test was not on how 
> much time backup takes but on time spent on the table were no data has been 
> changed.
> *Table without Data -->*
> Elapsed:  44mins, 52sec
> Average Map Time  3sec
> Average Shuffle Time  2mins, 35sec
> Average Merge Time0sec
> Average Reduce Time   0sec
> Map : 2052
> Reduce : 1
> *Table with Data -->*
> Elapsed:  1hrs, 44mins, 10sec
> Average Map Time  4sec
> Average Shuffle Time  37sec
> Average Merge Time3sec
> Average Reduce Time   47sec
> Map : 2052
> Reduce : 64
> All above numbers are a single node cluster so not many mappers run in 
> parallel. but let's extrapolate this to 20 node cluster, with ~100 tables and 
> data size to be backed up various for approx 2000 Wals, let us say each 20 
> node can process 3 containers i.e 60 wals in parallel. assume 3 sec are spent 
> in each WALs i.e. 6000\ 60 sec -->  100 per table --> 1 sec for all 
> tables. 
> ~166 mins -->  ~2.7 hrs only for filtering.  This does not seem to be scale. 
> (These are just rough numbers from a basic test). As all parsing is O (m 
> (WALS) * n (Tables))
> Main intend of this test is to see even the backup of very less churning 
> table might take good amount for just filtering the data. As number of table 
> or data increases, this does not seem scalable 
> Even i can see from our current cluster numbers easily close to 100 table, 
> 200 millions rows,  200 -300 GB.
> I would suggest that we should have filtering to parse WALs once and to 
> segregate in multiple WALs per table --> hFiles from per table wals. ( just a 
> rough idea).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HBASE-18872) Backup scaling for multiple table and millions of row

2021-01-19 Thread Mallikarjun (Jira)



[ 
https://issues.apache.org/jira/browse/HBASE-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17268313#comment-17268313
 ] 

Mallikarjun edited comment on HBASE-18872 at 1/20/21, 1:58 AM:
---

[~vishk] Your experiment made use of incremental backup or full backup? Based 
on my experiment I might have totally different numbers.

Full Backup does a snapshot copy --> This should not have dependence on WAL 
files. 

Incremental backup does generate hfiles from wal and does a Dist cp. --> This 
should be relatively smaller in size and need not scale to the extent full 
backup does.


was (Author: rda3mon):
[~vishk] Your experiment made use of incremental backup or full backup? Based 
on my experiment I might have totally different numbers.

> Backup scaling for multiple table and millions of row
> -
>
> Key: HBASE-18872
> URL: https://issues.apache.org/jira/browse/HBASE-18872
> Project: HBase
>  Issue Type: Improvement
>Reporter: Vishal Khandelwal
>Assignee: Vladimir Rodionov
>Priority: Major
>
> I did a simple experiment of loading ~200 million rows on a table 1 and 
> nothing in a table 2. This test was done on a local cluster ~ approx 3-4 
> containers were running in parallel. The focus of the test was not on how 
> much time backup takes but on time spent on the table were no data has been 
> changed.
> *Table without Data -->*
> Elapsed:  44mins, 52sec
> Average Map Time  3sec
> Average Shuffle Time  2mins, 35sec
> Average Merge Time0sec
> Average Reduce Time   0sec
> Map : 2052
> Reduce : 1
> *Table with Data -->*
> Elapsed:  1hrs, 44mins, 10sec
> Average Map Time  4sec
> Average Shuffle Time  37sec
> Average Merge Time3sec
> Average Reduce Time   47sec
> Map : 2052
> Reduce : 64
> All above numbers are a single node cluster so not many mappers run in 
> parallel. but let's extrapolate this to 20 node cluster, with ~100 tables and 
> data size to be backed up various for approx 2000 Wals, let us say each 20 
> node can process 3 containers i.e 60 wals in parallel. assume 3 sec are spent 
> in each WALs i.e. 6000\ 60 sec -->  100 per table --> 1 sec for all 
> tables. 
> ~166 mins -->  ~2.7 hrs only for filtering.  This does not seem to be scale. 
> (These are just rough numbers from a basic test). As all parsing is O (m 
> (WALS) * n (Tables))
> Main intend of this test is to see even the backup of very less churning 
> table might take good amount for just filtering the data. As number of table 
> or data increases, this does not seem scalable 
> Even i can see from our current cluster numbers easily close to 100 table, 
> 200 millions rows,  200 -300 GB.
> I would suggest that we should have filtering to parse WALs once and to 
> segregate in multiple WALs per table --> hFiles from per table wals. ( just a 
> rough idea).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HBASE-18872) Backup scaling for multiple table and millions of row

[jira] [Comment Edited] (HBASE-18872) Backup scaling for multiple table and millions of row

2 matches

Site Navigation

Mail list logo

Footer information