Hi, Thanks for the effort, answers below:
On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote: > On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <[email protected]> > wrote: > > We have a 5 node cluster, 4 of them being region servers. I am running a > > custom workload with YCSB and when the data is loading (heavy insert) at > > least one of the region servers are dying after about 600000 operations. > > > Tell us the character of your 'custom workload' please. > > The workload is below, the part that fails is the loading part (-load) which inserts all the records first) recordcount=10000000 operationcount=3000000 workload=com.yahoo.ycsb.workloads.CoreWorkload readallfields=true readproportion=0.5 updateproportion=0.1 scanproportion=0 insertproportion=0.35 readmodifywriteproportion=0.05 requestdistribution=zipfian > > > There are no abnormalities in the logs as far as I can see, the only > common > > point is that all of them(in different trials, different region servers > > fail) request for a flush as the last logs, given below. .out files are > > empty. I am looking at the /var/log/hbase folder for logs. Running sun > java > > 6 latest version. I couldn't find any logs that indicates a problem with > > java. Tried the tests with openjdk and had the same results. > > > > Its strange that flush is the last thing in your log. The process is > dead? We are exiting w/o a note in logs? Thats unusual. We usually > scream loudly when dying. > Yes, thats the strange part. The last line is a flush as if the process never failed. Yes, the process is dead and hbase cannot see the node. > > > I have set ulimits(50000) and xceivers(20000) for multiple users and > certain > > that they are correct. > > The first line in an hbase log prints out the ulimit it sees. You > might check that the hbase process for sure is picking up your ulimit > setting. > > That was a mistake I did a couple of days ago, checked it with cat /proc/<pid of reginserver>/limits and all related users like 'hbase' has those limits. Checked the logs: Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 ulimit -n 52768 > > > Also in the kernel logs, there are no apparent problems. > > > > (The mystery compounds) > > > 2011-03-07 15:07:58,301 DEBUG > > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction > > requested for > > usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. > > because regionserver60020.cacheFlusher; priority=3, compaction queue > size=18 > > 2011-03-07 15:07:58,301 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > NOT flushing memstore for region > > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., > > flushing=false, writesEnabled=false > > 2011-03-07 15:07:58,301 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Started memstore flush for > > usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., > > current region memstore size 68.6m > > 2011-03-07 15:07:58,310 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Flush requested on > > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. > > -end of log file- > > --- > > > > Nothing more? > > No, nothing after that. But quite a lot of logs before that, I can send them if you'd like. > Thanks, > St.Ack > Thanks alot!
