Hi, Qiao’s HBase log shows there were errors for HBase to open all table regions under “_MD_” schema, the error stack is like this:
2016-09-08 16:44:36,327 ERROR [RS_OPEN_REGION-hadoop2slave7:60020-0] handler.OpenRegionHandler: Failed open of region=TRAFODION._MD_.COLUMNS,,1471946223350.b6191867e73d4203d3ac6fad3c860138., starting to roll back the global memstore size. org.apache.hadoop.hbase.DroppedSnapshotException: region: TRAFODION._MD_.COLUMNS,,1471946223350.b6191867e73d4203d3ac6fad3c860138. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2243) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1972) at org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:3826) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionStores(HRegion.java:969) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:841) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:814) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5828) at org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion.openHRegion(TransactionalRegion.java:101) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5794) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5765) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5721) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:5672) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:356) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:126) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.AssertionError: Key \xB9"b*M3c\x00ADMCKID /#1:\x01/1473306352163/Put/vlen=8/seqid=1749 followed by a smaller key \xB9"b*M3c\x00ADMCKID /#1:\x01/1473306352163/Put/vlen=8/seqid=4003 in cf #1 at org.apache.hadoop.hbase.regionserver.StoreScanner.checkScanOrder(StoreScanner.java:699) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:493) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:115) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71) at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:940) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2217) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2197) ... 17 more Not sure why this happened, and the HBase work well. Since metadata is not available and Qiao’s data is just test data, so he reinitialize trafodion, and it recovered. I don’t have enough information to know the root cause yet. Above error happens during HBase startup. There are some HDFS error before HBase abort: --------------------------------------------------------------------------------- 2016-09-07 22:34:21,228 ERROR [regionserver/hadoop2slave7/10.1.1.22:60020] wal.ProtobufLogWriter: Got IOException while writing trailer java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1635) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833) at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843) at com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:80) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.writeWALTrailer(ProtobufLogWriter.java:157) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:130) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.shutdown(FSHLog.java:1149) at org.apache.hadoop.hbase.wal.DefaultWALProvider.shutdown(DefaultWALProvider.java:114) at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:215) at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1248) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1003) at java.lang.Thread.run(Thread.java:745) And 2016-09-07 22:34:20,765 ERROR [sync.4] wal.FSHLog: Error syncing, request close of wal java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560) 2016-09-07 22:34:20,767 WARN [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.FSHLog: Failed last sync but no outstanding unsync edits so falling through to close; java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK] are bad. Aborting... 2016-09-07 22:34:20,767 ERROR [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.ProtobufLogWriter: Got IOException while writing trailer java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560) 2016-09-07 22:34:20,767 ERROR [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] wal.FSHLog: Failed close of WAL writer hdfs://hadoop2slave7:8020/hbase/WALs/hadoop2slave7,60020,1473040797512/hadoop2slave7%2C60020%2C1473040797512..meta.1473255260637.meta, unflushedEntries=0 java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560) 2016-09-07 22:34:20,767 FATAL [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] regionserver.HRegionServer: ABORTING region server hadoop2slave7,60020,1473040797512: Failed log close in log roller org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://hadoop2slave7:8020/hbase/WALs/hadoop2slave7,60020,1473040797512/hadoop2slave7%2C60020%2C1473040797512..meta.1473255260637.meta, unflushedEntries=0 at org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:978) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:716) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[10.1.1.23:50010,DS-ba16d69a-682c-42db-8a4f-e0d369e5d397,DISK] are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1218) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1016) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:560) 2016-09-07 22:34:20,768 FATAL [RS_OPEN_META-hadoop2slave7:60020-0-MetaLogRoller] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.AggregateImplementation, org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint, org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionObserver, org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint] ------------------------------------------------------------------------------------ Not sure if these log info helps to find the root cause of metadata corruption? I am still investigating. Thanks, Ming From: 乔彦克 [mailto:qya...@gmail.com] Sent: Friday, September 09, 2016 11:27 AM To: d...@trafodion.incubator.apache.org; user@trafodion.incubator.apache.org Cc: Amanda Moran <amanda.mo...@esgyn.com>; Selva Govindarajan <selva.govindara...@esgyn.com>; Liu, Ming (Ming) <ming....@esgyn.cn> Subject: Re: Load with log error rows gets Trafodion not work Thanks to Selva and Amanda, I loaded three data sets from hive to Trafodion yesterday, the other two succeed and the last got the error. And this error result in that I cannot execute any query from trafci but "initialize trafodion, drop" (Thanks @Liuming told me to do so). Ming analyzed the hbase log and found that the data region belongs to trafodion cannot be opened. After I initialize trafodion again, I reload the three data sets and it goes well. @Selva, the Trafodion and Hbase are running normal and below is the result of 'sqvers -u' : perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = (unset), LC_ALL = (unset), LC_CTYPE = "UTF-8", LANG = "en_US.UTF-8" are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). cat: /opt/hptc/pdsh/nodes: No such file or directory MY_SQROOT=/home/trafodion/apache-trafodion_server-2.0.1-incubating who@host=trafodion@hadoop2slave7 JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 linux=2.6.32-220.el6.x86_64 redhat=6.2 NO patches Most common Apache_Trafodion Release 2.0.1 (Build release [DEV], branch -, date 24Jun16) UTT count is 2 [8] Apache_Trafodion Release 2.0.1 (Build release [DEV], branch release2.0, date 24Jun16) export/lib/hbase-trx-apache1_0_2-2.0.1.jar export/lib/hbase-trx-hdp2_3-2.0.1.jar export/lib/sqmanvers.jar export/lib/trafodion-dtm-apache1_0_2-2.0.1.jar export/lib/trafodion-dtm-hdp2_3-2.0.1.jar export/lib/trafodion-sql-apache1_0_2-2.0.1.jar export/lib/trafodion-sql-hdp2_3-2.0.1.jar export/lib/trafodion-utility-2.0.1.jar [3] Release 2.0.1 (Build release [DEV], branch release2.0, date 24Jun16) export/lib/jdbcT2.jar export/lib/jdbcT4.jar export/lib/lib_mgmt.jar @Amanda: The Hdfs /user directory has not the user trafodion, just root and hive. But I can load and insert data into Trafodion, so I don't think the problem is there. Thank you for your replies. Many thanks again, Qiao Amanda Moran <amanda.mo...@esgyn.com<mailto:amanda.mo...@esgyn.com>>于2016年9月9日周五 上午1:03写道: Please run this command: sudo su hdfs --command "hadoop fs -ls /user" Please verify you have the trafodion user id listed there. Thanks! Amanda On Thu, Sep 8, 2016 at 8:08 AM, Selva Govindarajan < selva.govindara...@esgyn.com<mailto:selva.govindara...@esgyn.com>> wrote: > Hi Qiao, > > > > The JIRA you mentioned in the message is already fixed and merged to > Trafodion on July 20th. It is unfortunate that this JIRA wasn’t marked > as resolved. I have marked it as resolved now. This JIRA deals with the > issue of trafodion process aborting when there is an error while logging > the error rows. The error rows are logged in hdfs directly. Most likely > the “Trafodion” user has no write permission to the hdfs directory where > the error is logged. > > > > You can try “Load with continue on error … “ command instead and check if > it works. > > > > Can you also please send the output of the command below to confirm if the > version installed has the above fix. > > > > sqvers -u > > > > Can you also issue the following command to confirm if the Trafodion and > hbase are started successfully. > > > > hbcheck > > sqcheck > > > > > > Selva > > *From:* 乔彦克 [mailto:qya...@gmail.com<mailto:qya...@gmail.com>] > *Sent:* Thursday, September 8, 2016 12:20 AM > *To:* > user@trafodion.incubator.apache.org<mailto:user@trafodion.incubator.apache.org>; > dev@trafodion.incubator.apache<mailto:dev@trafodion.incubator.apache> > .org > *Subject:* Load with log error rows gets Trafodion not work > > > > Hi, all, > > I used load with log error rows to load data from hive, and got the > following error: > > [image: loaderr.png] > > which leading to hbase-region server crashed. > > I restart Hbase region serve and Trafodion, but query in Trafodion has no > response, even the simplest query "get tables;" or " get schemas". > > Can someone help me to let Trafodion go normal? > > https://issues.apache.org/jira/browse/TRAFODION-2109, this jira describe > the same problem. > > > > Any reply is appreciated. > > Thank you > > Qiao > -- Thanks, Amanda Moran