[ 
https://issues.apache.org/jira/browse/TRAFODION-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708643#comment-14708643
 ] 

Atanu Mishra commented on TRAFODION-270:
----------------------------------------

Oliver Bucaojit (oliver-bucaojit) wrote on 2014-05-15:  #1
In the transactional region, we delay the split from occurring until there are 
no active transactions. The transactionstate is held by each region and work 
would need to be done to move the necessary data to the daughter region on 
split if we are to support region splitting.

Our current fix for this issue is to have regions delay splitting when there 
are preparing or active transactions during a split, which is why the log 
message is being repeated that it is stuck Preparing to close the transactional 
region. One solution that can be done on the user side is to check for open 
sqlci or jdbc connections with transactions running and close them. This will 
allow the region to split immediately.

Another solution that I've implemented would disable the split delaying and 
transactions that are in flight will then be aborted since they will not be 
able to communicate with the region after it has been split and relocated. This 
would be useful in development or if we want to split and wouldn't want the 
region to get stuck in a loop. I have seen cases where we get stuck in a loop 
where the C++ side DTM aborts or gets killed and a transaction remains on the 
HBase region. This property to disable the split delay is below, and is added 
to conf/hbase-site.xml:
      <property>
        <name>hbase.regionserver.region.split.delay</name>
        <value>false</value>
      </property>

If there is a case where there are no sqlci or transactions running and we see 
that the HBase region is stuck in this loop waiting for active transactions, 
and the TM is still running, then there is a bug and we will need to gather 
more logging and process information to debug this issue. One way to easily 
check if there is a transaction running from the TM perspective is through 
dtmci. Running dtmci and using the 'list' command will print out the current 
transactions and it's state.

Bouncing the system will also get HBase back into a normal state because there 
will be no active transactions at that point. If there were any prepared 
transactions then it would go through the recovery flow and get redriven to 
abort or commit.

Atanu Mishra (atanu-mishra) wrote on 2014-05-19:        #2
Analysis by Narendra --

Later today, I would be checking in these changes to the datalake branch. 
Basically, only use the TransactionalRegion when the transid is not 0. That 
way, we will not ‘implicitly’ start a transaction with transid=0. This should 
help with the Launchpad bug 1319965.

What happens is that once we start a transaction (in the region server) with 
transid=0, it stays put (as the user did not start a transaction with 
‘transid=0’), and hence when the user tries to split the region manually, the 
splitting does not happen (as the object transactionsById is not empty)

[I applied the same code updates to the other aggregator methods in this class 
(getMax/Min/Sum…)]

I am hoping that it would help with the bug 1309121 too.

Changed in trafodion:
status: New → In Progress
Atanu Mishra (atanu-mishra) on 2014-05-20
Changed in trafodion:
status: In Progress → Fix Committed
Stacey Johnson (sjohnson-w) on 2014-06-10
information type:       Proprietary → Public
Alice Chen (alchen) on 2014-10-15
Changed in trafodion:
milestone:      none → r0.8
status: Fix Committed → Fix Released


> LP Bug: 1319965 - Regionserver looping on active transaction.
> -------------------------------------------------------------
>
>                 Key: TRAFODION-270
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-270
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: dtm
>            Reporter: Guy Groulx
>            Assignee: Oliver Bucaojit
>            Priority: Critical
>             Fix For: 0.8 (pre-incubation)
>
>
> I was trying to create a scenario for Trina on the errors we are getting when 
> a split and/or balance happens.
> So I tried to use the split command on a table to force a split.     It 
> looked like the split command was not working.    I started looking at the 
> various logs and found that the regionserver containing my table goes into a 
> loop about a transaction not completing.
> Here’s my table.
> Table Regions
> Name  Region Server   Start Key       End Key Requests
> Name  Region Server   Start Key       End Key Requests
> TRAFODION.MXOLTP.TBL500,,1400169591672.616691fe476f230bfebe6b7a3907f0b8.      
> n006.cm.cluster:60030           \x00\x07\xBF\x14
> TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.
>       n006.cm.cluster:60030   \x00\x07\xBF\x14
> IE:   Currently both on n006.
> Starting trafci:
> SQL>set schema trafodion.mxoltp;
> --- SQL operation complete.
> SQL>prepare cmd from select count(*) from TBL500;
> --- SQL command prepared.
> SQL>execute cmd;
> (EXPR)
> --------------------
>              1015500
> --- 1 row(s) selected.
> SQL>
> In hbase shell, I enabled balancer.
> split 'TRAFODION.MXOLTP.TBL500’, ‘\x00\x07\x00\x00’
> And the regionserver on n006 starts looping on the following:
> 2014-05-15 18:20:02,910 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to split region 
> TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.
> 2014-05-15 18:20:02,911 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:03,911 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:04,912 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:05,912 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:06,913 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:07,913 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:08,913 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:09,914 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:10,914 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> 2014-05-15 18:20:11,915 INFO 
> org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: 
> Preparing to close transactional region 
> [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.],
>  but still have [0] transactions that are pending commit. And [ 1] active 
> transactions. Sleeping
> ^C
> [root@n006 hbase]#
> Even if I stop my hpdci session, the msgs above continue.
> I then have to bounce the regionserver to recover from this.   And the spit 
> never actually happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to