"sync" is a fluffy term in HDFS. HDFS has hsync and hflush. hflush forces all current changes at a DFSClient to all replica nodes (but not to disk).
Until HDFS-744 hsync would be identical to hflush. After HDFS-744 hsync can be used to force data to disk at the replicas. When HBase refers to "sync" the hflush semantics are meant (at least until HBASE-5954 is finished). I.e. a sync here ensures that the replica nodes have seen the changes, which is what you want. So when you say "since another copy is always there on the replica nodes", that is only guaranteed after an hflush (again, which HBase calls sync). I've also written about this here: http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html -- Lars ________________________________ From: Mohit Anchlia <[email protected]> To: [email protected] Sent: Tuesday, July 31, 2012 6:09 PM Subject: sync on writes In the HBase book it mentioned that the default behaviour of write is to call sync on each node before sending replica copies to the nodes in the pipeline. Is there a reason this was kept default because if data is getting written on multiple nodes then likelyhood of losing data is really low since another copy is always there on the replica nodes. Is it ok to make this sync async and is it advisable?
