RES: Bulk Import & Data Locality

Cristofer Weber Wed, 18 Jul 2012 10:28:28 -0700

Hi Alex

Here we worked with bulk import creating the HFiles in a MR job and we finish 
the load calling doBulkLoad method of LoadIncrementalHFiles class (probably the 
same method used by completebulkload tool) and HFiles generated by reducer 
tasks are correctly 'adopted' by each corresponding region server because these 
files got placed in correct directories.


I never wondered if doBulkLoad is aware of region locations when copying files 
because our major compaction runs right after bulk load, but what occurs me 
right now is that it is possible to check block locations using the namenode 
UI, as region names matches region directories inside your table dir in HDFS. 

Tried it here and in fact they match, but we ran major compaction and for sure 
hfiles must be collocated with correspondent RS.

Regards,
Cristofer


-----Mensagem original-----
De: Alex Baranau [mailto:[email protected]] 
Enviada em: quarta-feira, 18 de julho de 2012 12:46
Para: [email protected]; [email protected]; 
[email protected]
Assunto: Bulk Import & Data Locality

Hello,

As far as I understand Bulk Import functionality will not take into account the 
Data Locality question. MR job will create number of reducer tasks same as 
regions to write into, but it will not "advice" on which nodes to run these 
tasks. In that case Reducer task which writes HFiles of some region may not be 
physically located at the same node as RS that serves that region. The way HDFS 
writes data, there will be (likely) one full replica of bolcks of HFiles of 
this Region written on the node where Reducer task was run and other replicas 
(if replication >1) will be distributed randomly over the cluster. Thus, RS 
while serving data of that region will (most
likely) not look at local data (data will be transferred from other datanodes). 
I.e. data locality will be broken.

Is this correct?

If yes, I guess, if we could tell MR framework where (which nodes) to launch 
certain Reducer tasks, this would help us. I believe this is not possible with 
MR1, please correct me if I'm wrong. Perhaps, this is this possible with MR2?

I assume there's no way to provide a "hint" to a NameNode where to place blocks 
of a new File too, right?

Thank you,
--
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

RES: Bulk Import & Data Locality

Reply via email to