I don't know if its normal but I see alot of '0's in the test results when it tends to fail, such as:
1196 sec: 7394901 operations; 0 current ops/sec; -- deniz On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <[email protected]> wrote: > Hi, > > Thanks for the effort, answers below: > > > > > On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote: > >> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <[email protected]> >> wrote: >> > We have a 5 node cluster, 4 of them being region servers. I am running a >> > custom workload with YCSB and when the data is loading (heavy insert) at >> > least one of the region servers are dying after about 600000 operations. >> >> >> Tell us the character of your 'custom workload' please. >> >> > The workload is below, the part that fails is the loading part (-load) > which inserts all the records first) > > recordcount=10000000 > operationcount=3000000 > workload=com.yahoo.ycsb.workloads.CoreWorkload > > readallfields=true > > readproportion=0.5 > updateproportion=0.1 > scanproportion=0 > insertproportion=0.35 > readmodifywriteproportion=0.05 > > requestdistribution=zipfian > > > > >> >> > There are no abnormalities in the logs as far as I can see, the only >> common >> > point is that all of them(in different trials, different region servers >> > fail) request for a flush as the last logs, given below. .out files are >> > empty. I am looking at the /var/log/hbase folder for logs. Running sun >> java >> > 6 latest version. I couldn't find any logs that indicates a problem with >> > java. Tried the tests with openjdk and had the same results. >> > >> >> Its strange that flush is the last thing in your log. The process is >> dead? We are exiting w/o a note in logs? Thats unusual. We usually >> scream loudly when dying. >> > > Yes, thats the strange part. The last line is a flush as if the process > never failed. Yes, the process is dead and hbase cannot see the node. > > >> >> > I have set ulimits(50000) and xceivers(20000) for multiple users and >> certain >> > that they are correct. >> >> The first line in an hbase log prints out the ulimit it sees. You >> might check that the hbase process for sure is picking up your ulimit >> setting. >> >> That was a mistake I did a couple of days ago, checked it with cat > /proc/<pid of reginserver>/limits and all related users like 'hbase' has > those limits. Checked the logs: > > Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 > ulimit -n 52768 > >> >> > Also in the kernel logs, there are no apparent problems. >> > >> >> (The mystery compounds) >> >> > 2011-03-07 15:07:58,301 DEBUG >> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction >> > requested for >> > usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. >> > because regionserver60020.cacheFlusher; priority=3, compaction queue >> size=18 >> > 2011-03-07 15:07:58,301 DEBUG >> org.apache.hadoop.hbase.regionserver.HRegion: >> > NOT flushing memstore for region >> > >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., >> > flushing=false, writesEnabled=false >> > 2011-03-07 15:07:58,301 DEBUG >> org.apache.hadoop.hbase.regionserver.HRegion: >> > Started memstore flush for >> > >> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., >> > current region memstore size 68.6m >> > 2011-03-07 15:07:58,310 DEBUG >> org.apache.hadoop.hbase.regionserver.HRegion: >> > Flush requested on >> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. >> > -end of log file- >> > --- >> > >> >> Nothing more? >> >> > No, nothing after that. But quite a lot of logs before that, I can send > them if you'd like. > > > >> Thanks, >> St.Ack >> > > Thanks alot! > >
