[ https://issues.apache.org/jira/browse/TRAFODION-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708643#comment-14708643 ]
Atanu Mishra commented on TRAFODION-270: ---------------------------------------- Oliver Bucaojit (oliver-bucaojit) wrote on 2014-05-15: #1 In the transactional region, we delay the split from occurring until there are no active transactions. The transactionstate is held by each region and work would need to be done to move the necessary data to the daughter region on split if we are to support region splitting. Our current fix for this issue is to have regions delay splitting when there are preparing or active transactions during a split, which is why the log message is being repeated that it is stuck Preparing to close the transactional region. One solution that can be done on the user side is to check for open sqlci or jdbc connections with transactions running and close them. This will allow the region to split immediately. Another solution that I've implemented would disable the split delaying and transactions that are in flight will then be aborted since they will not be able to communicate with the region after it has been split and relocated. This would be useful in development or if we want to split and wouldn't want the region to get stuck in a loop. I have seen cases where we get stuck in a loop where the C++ side DTM aborts or gets killed and a transaction remains on the HBase region. This property to disable the split delay is below, and is added to conf/hbase-site.xml: <property> <name>hbase.regionserver.region.split.delay</name> <value>false</value> </property> If there is a case where there are no sqlci or transactions running and we see that the HBase region is stuck in this loop waiting for active transactions, and the TM is still running, then there is a bug and we will need to gather more logging and process information to debug this issue. One way to easily check if there is a transaction running from the TM perspective is through dtmci. Running dtmci and using the 'list' command will print out the current transactions and it's state. Bouncing the system will also get HBase back into a normal state because there will be no active transactions at that point. If there were any prepared transactions then it would go through the recovery flow and get redriven to abort or commit. Atanu Mishra (atanu-mishra) wrote on 2014-05-19: #2 Analysis by Narendra -- Later today, I would be checking in these changes to the datalake branch. Basically, only use the TransactionalRegion when the transid is not 0. That way, we will not ‘implicitly’ start a transaction with transid=0. This should help with the Launchpad bug 1319965. What happens is that once we start a transaction (in the region server) with transid=0, it stays put (as the user did not start a transaction with ‘transid=0’), and hence when the user tries to split the region manually, the splitting does not happen (as the object transactionsById is not empty) [I applied the same code updates to the other aggregator methods in this class (getMax/Min/Sum…)] I am hoping that it would help with the bug 1309121 too. Changed in trafodion: status: New → In Progress Atanu Mishra (atanu-mishra) on 2014-05-20 Changed in trafodion: status: In Progress → Fix Committed Stacey Johnson (sjohnson-w) on 2014-06-10 information type: Proprietary → Public Alice Chen (alchen) on 2014-10-15 Changed in trafodion: milestone: none → r0.8 status: Fix Committed → Fix Released > LP Bug: 1319965 - Regionserver looping on active transaction. > ------------------------------------------------------------- > > Key: TRAFODION-270 > URL: https://issues.apache.org/jira/browse/TRAFODION-270 > Project: Apache Trafodion > Issue Type: Bug > Components: dtm > Reporter: Guy Groulx > Assignee: Oliver Bucaojit > Priority: Critical > Fix For: 0.8 (pre-incubation) > > > I was trying to create a scenario for Trina on the errors we are getting when > a split and/or balance happens. > So I tried to use the split command on a table to force a split. It > looked like the split command was not working. I started looking at the > various logs and found that the regionserver containing my table goes into a > loop about a transaction not completing. > Here’s my table. > Table Regions > Name Region Server Start Key End Key Requests > Name Region Server Start Key End Key Requests > TRAFODION.MXOLTP.TBL500,,1400169591672.616691fe476f230bfebe6b7a3907f0b8. > n006.cm.cluster:60030 \x00\x07\xBF\x14 > TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753. > n006.cm.cluster:60030 \x00\x07\xBF\x14 > IE: Currently both on n006. > Starting trafci: > SQL>set schema trafodion.mxoltp; > --- SQL operation complete. > SQL>prepare cmd from select count(*) from TBL500; > --- SQL command prepared. > SQL>execute cmd; > (EXPR) > -------------------- > 1015500 > --- 1 row(s) selected. > SQL> > In hbase shell, I enabled balancer. > split 'TRAFODION.MXOLTP.TBL500’, ‘\x00\x07\x00\x00’ > And the regionserver on n006 starts looping on the following: > 2014-05-15 18:20:02,910 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to split region > TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753. > 2014-05-15 18:20:02,911 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:03,911 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:04,912 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:05,912 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:06,913 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:07,913 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:08,913 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:09,914 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:10,914 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > 2014-05-15 18:20:11,915 INFO > org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: > Preparing to close transactional region > [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], > but still have [0] transactions that are pending commit. And [ 1] active > transactions. Sleeping > ^C > [root@n006 hbase]# > Even if I stop my hpdci session, the msgs above continue. > I then have to bounce the regionserver to recover from this. And the spit > never actually happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332)