[ 
https://issues.apache.org/jira/browse/HDFS-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-10942:
---------------------------------
    Description: 
We use EditsDoubleBuffer to handle edit logs:
{code}
/**
 * A double-buffer for edits. New edits are written into the first buffer
 * while the second is available to be flushed. Each time the double-buffer
 * is flushed, the two internal buffers are swapped. This allows edits
 * to progress concurrently to flushes without allocating new buffers each
 * time.
 */
{code}

With the following code, that flush the ready buffer, it copy the ready buffer 
to a local copy, then flush.

{code}

QuarumOutputStream (buf in the code below is an instance of EditsDoubleBuffer):

  @Override
  protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();

      assert numReadyTxns > 0;

      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();
{code}

 The above call doesn't seem to prevent the orginal copy of the buffer inside 
buf from being swapped by  the following method. 

{code}
EditsDoubleBuffer:

 public void setReadyToFlush() {
    assert isFlushed() : "previous data not flushed yet";
    TxnBuffer tmp = bufReady;
    bufReady = bufCurrent;
    bufCurrent = tmp;
  }

{code}

Though we have some runtime assertion in the code, the assertion is not enabled 
in production, so the expected condition in the assert may be false at runtime. 
This would possibly cause a mess.  When any condition is not as expected by the 
assertion, it seems a real exception should be thrown instead.

So two issues in summary:
- How we synchronize between the flush and the swap of the two buffers
- Whether we should throw real exception instead of the assert that's disabled 
at runtime normally.

A possible third issue:
* If NN crashes for some reason, before the edits in bufCurrent gets to 
bufReady and flushed, these edits in will be lost. 


  was:
We use EditsDoubleBuffer to handle edit logs:
{code}
/**
 * A double-buffer for edits. New edits are written into the first buffer
 * while the second is available to be flushed. Each time the double-buffer
 * is flushed, the two internal buffers are swapped. This allows edits
 * to progress concurrently to flushes without allocating new buffers each
 * time.
 */
{code}

With the following code, that flush the ready buffer, it copy the ready buffer 
to a local copy, then flush.

{code}

QuarumOutputStream (buf in the code below is an instance of EditsDoubleBuffer):

  @Override
  protected void flushAndSync(boolean durable) throws IOException {
    int numReadyBytes = buf.countReadyBytes();
    if (numReadyBytes > 0) {
      int numReadyTxns = buf.countReadyTxns();
      long firstTxToFlush = buf.getFirstReadyTxId();

      assert numReadyTxns > 0;

      // Copy from our double-buffer into a new byte array. This is for
      // two reasons:
      // 1) The IPC code has no way of specifying to send only a slice of
      //    a larger array.
      // 2) because the calls to the underlying nodes are asynchronous, we
      //    need a defensive copy to avoid accidentally mutating the buffer
      //    before it is sent.
      DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
      buf.flushTo(bufToSend);
      assert bufToSend.getLength() == numReadyBytes;
      byte[] data = bufToSend.getData();
      assert data.length == bufToSend.getLength();
{code}

 The above call doesn't seem to prevent the orginal copy of the buffer inside 
buf from being swapped by  the following method. 

{code}
EditsDoubleBuffer:

 public void setReadyToFlush() {
    assert isFlushed() : "previous data not flushed yet";
    TxnBuffer tmp = bufReady;
    bufReady = bufCurrent;
    bufCurrent = tmp;
  }

{code}

Though we have some runtime assertion in the code, the assertion is not enabled 
in production, so the expected condition in the assert may be false at runtime. 
This would possibly cause a mess.  When any condition is not as expected by the 
assertion, it seems a real exception should be thrown instead.

So two issues in summary:
- How we synchronize between the flush and the swap of the two buffers
- Whether we should throw real exception instead of the assert that's disabled 
at runtime normally.




> Incorrect handling of flushing edit logs to JN
> ----------------------------------------------
>
>                 Key: HDFS-10942
>                 URL: https://issues.apache.org/jira/browse/HDFS-10942
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, s
>            Reporter: Yongjun Zhang
>
> We use EditsDoubleBuffer to handle edit logs:
> {code}
> /**
>  * A double-buffer for edits. New edits are written into the first buffer
>  * while the second is available to be flushed. Each time the double-buffer
>  * is flushed, the two internal buffers are swapped. This allows edits
>  * to progress concurrently to flushes without allocating new buffers each
>  * time.
>  */
> {code}
> With the following code, that flush the ready buffer, it copy the ready 
> buffer to a local copy, then flush.
> {code}
> QuarumOutputStream (buf in the code below is an instance of 
> EditsDoubleBuffer):
>   @Override
>   protected void flushAndSync(boolean durable) throws IOException {
>     int numReadyBytes = buf.countReadyBytes();
>     if (numReadyBytes > 0) {
>       int numReadyTxns = buf.countReadyTxns();
>       long firstTxToFlush = buf.getFirstReadyTxId();
>       assert numReadyTxns > 0;
>       // Copy from our double-buffer into a new byte array. This is for
>       // two reasons:
>       // 1) The IPC code has no way of specifying to send only a slice of
>       //    a larger array.
>       // 2) because the calls to the underlying nodes are asynchronous, we
>       //    need a defensive copy to avoid accidentally mutating the buffer
>       //    before it is sent.
>       DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
>       buf.flushTo(bufToSend);
>       assert bufToSend.getLength() == numReadyBytes;
>       byte[] data = bufToSend.getData();
>       assert data.length == bufToSend.getLength();
> {code}
>  The above call doesn't seem to prevent the orginal copy of the buffer inside 
> buf from being swapped by  the following method. 
> {code}
> EditsDoubleBuffer:
>  public void setReadyToFlush() {
>     assert isFlushed() : "previous data not flushed yet";
>     TxnBuffer tmp = bufReady;
>     bufReady = bufCurrent;
>     bufCurrent = tmp;
>   }
> {code}
> Though we have some runtime assertion in the code, the assertion is not 
> enabled in production, so the expected condition in the assert may be false 
> at runtime. This would possibly cause a mess.  When any condition is not as 
> expected by the assertion, it seems a real exception should be thrown instead.
> So two issues in summary:
> - How we synchronize between the flush and the swap of the two buffers
> - Whether we should throw real exception instead of the assert that's 
> disabled at runtime normally.
> A possible third issue:
> * If NN crashes for some reason, before the edits in bufCurrent gets to 
> bufReady and flushed, these edits in will be lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to