Hi all, Thanks for the support. I'v been trying to replicate the problem since this morning. Before doing that, played with the configuration. I used to have only one user and set all the permissions according to that. Now I'v followed the cloudera manuals and set permissions for hdfs and mapred users. (changed the hbase-env.sh)
I had 2 trials, on both the yahoo test failed because of receiving lost of "0"s but the region servers didn't die. At some points in the test, (also when failed) , hbase master gave exceptions about not being able to reach one of the servers. I also lost the ssh connection to that server, but after a while it recovered. (also hmaster) The last thing in the regionserver logs was that it was going for a flush. I'll be going over the tests again and provide you with clean log files from all servers. (hadoop, hbase, namenode, masternode logs) If you have any suggestions or directions for me to better diagnose the problem, that would be lovely. btw: these servers do not have ECC memory but I do not see any corruption in data. Thanks! -- deniz On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans <[email protected]>wrote: > Along with a bigger portion of the log, it be might good to check if > there's anything in the .out file that looks like a jvm error. > > J-D > > On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]> > wrote: > > I run every kind of benchmark I could find on those machines and they > seemed > > to work fine. Did memory/disk tests too. > > > > The master node or other nodes provide some information and exceptions > about > > that they can't reach to the dead node. > > > > Btw sometimes the process does not die but looses the connection. > > > > -- > > > > deniz > > > > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote: > > > >> I'm stumped. I have nothing to go on when no death throes or > >> complaints. This hardware for sure is healthy? Other stuff runs w/o > >> issue? > >> St.Ack > >> > >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <[email protected]> > >> wrote: > >> > I don't know if its normal but I see alot of '0's in the test results > >> when > >> > it tends to fail, such as: > >> > > >> > 1196 sec: 7394901 operations; 0 current ops/sec; > >> > > >> > -- > >> > deniz > >> > > >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <[email protected]> > >> wrote: > >> > > >> >> Hi, > >> >> > >> >> Thanks for the effort, answers below: > >> >> > >> >> > >> >> > >> >> > >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote: > >> >> > >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < > [email protected]> > >> >>> wrote: > >> >>> > We have a 5 node cluster, 4 of them being region servers. I am > >> running a > >> >>> > custom workload with YCSB and when the data is loading (heavy > insert) > >> at > >> >>> > least one of the region servers are dying after about 600000 > >> operations. > >> >>> > >> >>> > >> >>> Tell us the character of your 'custom workload' please. > >> >>> > >> >>> > >> >> The workload is below, the part that fails is the loading part > (-load) > >> >> which inserts all the records first) > >> >> > >> >> recordcount=10000000 > >> >> operationcount=3000000 > >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload > >> >> > >> >> readallfields=true > >> >> > >> >> readproportion=0.5 > >> >> updateproportion=0.1 > >> >> scanproportion=0 > >> >> insertproportion=0.35 > >> >> readmodifywriteproportion=0.05 > >> >> > >> >> requestdistribution=zipfian > >> >> > >> >> > >> >> > >> >> > >> >>> > >> >>> > There are no abnormalities in the logs as far as I can see, the > only > >> >>> common > >> >>> > point is that all of them(in different trials, different region > >> servers > >> >>> > fail) request for a flush as the last logs, given below. .out > files > >> are > >> >>> > empty. I am looking at the /var/log/hbase folder for logs. Running > >> sun > >> >>> java > >> >>> > 6 latest version. I couldn't find any logs that indicates a > problem > >> with > >> >>> > java. Tried the tests with openjdk and had the same results. > >> >>> > > >> >>> > >> >>> Its strange that flush is the last thing in your log. The process > is > >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We > usually > >> >>> scream loudly when dying. > >> >>> > >> >> > >> >> Yes, thats the strange part. The last line is a flush as if the > process > >> >> never failed. Yes, the process is dead and hbase cannot see the node. > >> >> > >> >> > >> >>> > >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users > and > >> >>> certain > >> >>> > that they are correct. > >> >>> > >> >>> The first line in an hbase log prints out the ulimit it sees. You > >> >>> might check that the hbase process for sure is picking up your > ulimit > >> >>> setting. > >> >>> > >> >>> That was a mistake I did a couple of days ago, checked it with cat > >> >> /proc/<pid of reginserver>/limits and all related users like 'hbase' > >> has > >> >> those limits. Checked the logs: > >> >> > >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 > >> >> ulimit -n 52768 > >> >> > >> >>> > >> >>> > Also in the kernel logs, there are no apparent problems. > >> >>> > > >> >>> > >> >>> (The mystery compounds) > >> >>> > >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > Compaction > >> >>> > requested for > >> >>> > > >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. > >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction > queue > >> >>> size=18 > >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >>> > NOT flushing memstore for region > >> >>> > > >> >>> > >> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., > >> >>> > flushing=false, writesEnabled=false > >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >>> > Started memstore flush for > >> >>> > > >> >>> > >> > usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., > >> >>> > current region memstore size 68.6m > >> >>> > 2011-03-07 15:07:58,310 DEBUG > >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >>> > Flush requested on > >> >>> > > >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. > >> >>> > -end of log file- > >> >>> > --- > >> >>> > > >> >>> > >> >>> Nothing more? > >> >>> > >> >>> > >> >> No, nothing after that. But quite a lot of logs before that, I can > send > >> >> them if you'd like. > >> >> > >> >> > >> >> > >> >>> Thanks, > >> >>> St.Ack > >> >>> > >> >> > >> >> Thanks alot! > >> >> > >> >> > >> > > >> > > >
