[ 
https://issues.apache.org/jira/browse/HIVE-16865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048991#comment-16048991
 ] 

anishek commented on HIVE-16865:
--------------------------------

h4. Bootstrap Replication Dump
* db metadata one at a time
* function metadata one at a time
* get all tableNames — which should not be overwhelming         
* tables is also requested one a time. — all partition definitions are loaded 
in one go for a table and we dont expect a table with more than 5-10 partition 
definition columns
* the partition object  themselves will be done in batches via the 
PartitionIteratable
* only problem seems to be when writing data to _files where in we load all the 
file status objects per partition ( for partitioned tables) and per table 
otherwise , in memory. this might lead  OOM cases :: decision : this is not a 
problem as for split computation we will do the same, where we have not faced 
this issue.
* we create replCopyTask that will create the _files for all tables / 
partitions etc during analysis time and then go to execution engine, this will  
lead to lot of objects stored in memory given the above scale targets. 
*possibly* 
** move the dump enclosed in a task itself which manage its own thread pools to 
subsequently analyze/dump tables in execution phase, this will lead to possible 
blurring of demarcation of execution vs analysis phase within hive. 
** Another mode might be to provide lazy incremental task, from analysis to 
execution phase, such that both phases run simultaneously rather than one 
completing before another is started, this will lead to significant change in 
code to allow the same and currently only seems to be required only for 
replication.
** we might have to do the same for _incremental replication dump_ too as the 
_*from*_ and _*to*_ event ids might have millions of events will all of them 
being inserts, though the creation of _files is handled differently here where 
in we write the files along with metadata, we should be able to do the same for 
bootstrap replication also rather than creating replcopy task. this would mean 
the  replCopyTask should effectively be only used during load time. The only 
problem using this approach is that since the process is single threaded we are 
going to dump data sequentially and it might take long time, unless we do some 
threading in ReplicationSemanticAnalyzer to dump tables with some parallel 
since there is no dependency between tables when dumping them, a similar 
approach might be required for partitions also within tables. 

h4.Bootstrap Replication Load
* list all the table metadata files per db. For massive databases we will load 
a per above on the order of a million filestatus objects in memory. This seems 
to significant higher order of objects loaded than probably during split 
computation and hence might need to look at it.  most probably move to 
{code}org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus>      
listFiles(Path f, boolean recursive){code}
* a task will be created for each type of operation, in case of bootstrap one 
task per table / partition / function /database, hence we will encounter the 
last problem in _*Bootstrap Replication Dump*_ 



h4. Additional thoughts
* Since there can be multiple instance of metastores, from an integration w.r.t 
beacon for replication, would it be better to have a dedicated metastore 
instance for replication related workload(at least for bootstrap), since the 
execution of tasks will take place on the metastore instance it might be better 
served for the customer to have one metastore for replication and others to 
handle normal workloads. This can be achieved, I think, based on how the URL's 
are configured on HS2/beacon side. 
* On calling distcp in replcopytask can we log the sourcepath to destpath else 
if there are problems during copying we wont know the actual paths.
* On replica warehouse since replication tasks will run alongside normal 
execution of other hive tasks assuming there are multiple db's on replica, how 
do we constraint resource allocation for replication vs normal task ? how do we 
manage this such that we dont lag behind replication significantly ? 


> Handle replication bootstrap of large databases
> -----------------------------------------------
>
>                 Key: HIVE-16865
>                 URL: https://issues.apache.org/jira/browse/HIVE-16865
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: anishek
>            Assignee: anishek
>             Fix For: 3.0.0
>
>
> for larger databases make sure that we can handle replication bootstrap.
> * Assuming large database can have close to million tables or a few tables 
> with few hundred thousand partitions. 
> *  for function replication if a primary warehouse has large number of custom 
> functions defined such that the same binary file in corporates most of these 
> functions then on the replica warehouse there might be a problem in loading 
> all these functions as we will have the same jar on primary copied over for 
> each function such that each function will have a local copy of the jar, 
> loading all these jars might lead to excessive memory usage. 
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to