Along with a bigger portion of the log, it be might good to check if there's anything in the .out file that looks like a jvm error.
J-D On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]> wrote: > I run every kind of benchmark I could find on those machines and they seemed > to work fine. Did memory/disk tests too. > > The master node or other nodes provide some information and exceptions about > that they can't reach to the dead node. > > Btw sometimes the process does not die but looses the connection. > > -- > > deniz > > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote: > >> I'm stumped. I have nothing to go on when no death throes or >> complaints. This hardware for sure is healthy? Other stuff runs w/o >> issue? >> St.Ack >> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <[email protected]> >> wrote: >> > I don't know if its normal but I see alot of '0's in the test results >> when >> > it tends to fail, such as: >> > >> > 1196 sec: 7394901 operations; 0 current ops/sec; >> > >> > -- >> > deniz >> > >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <[email protected]> >> wrote: >> > >> >> Hi, >> >> >> >> Thanks for the effort, answers below: >> >> >> >> >> >> >> >> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote: >> >> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <[email protected]> >> >>> wrote: >> >>> > We have a 5 node cluster, 4 of them being region servers. I am >> running a >> >>> > custom workload with YCSB and when the data is loading (heavy insert) >> at >> >>> > least one of the region servers are dying after about 600000 >> operations. >> >>> >> >>> >> >>> Tell us the character of your 'custom workload' please. >> >>> >> >>> >> >> The workload is below, the part that fails is the loading part (-load) >> >> which inserts all the records first) >> >> >> >> recordcount=10000000 >> >> operationcount=3000000 >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload >> >> >> >> readallfields=true >> >> >> >> readproportion=0.5 >> >> updateproportion=0.1 >> >> scanproportion=0 >> >> insertproportion=0.35 >> >> readmodifywriteproportion=0.05 >> >> >> >> requestdistribution=zipfian >> >> >> >> >> >> >> >> >> >>> >> >>> > There are no abnormalities in the logs as far as I can see, the only >> >>> common >> >>> > point is that all of them(in different trials, different region >> servers >> >>> > fail) request for a flush as the last logs, given below. .out files >> are >> >>> > empty. I am looking at the /var/log/hbase folder for logs. Running >> sun >> >>> java >> >>> > 6 latest version. I couldn't find any logs that indicates a problem >> with >> >>> > java. Tried the tests with openjdk and had the same results. >> >>> > >> >>> >> >>> Its strange that flush is the last thing in your log. The process is >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We usually >> >>> scream loudly when dying. >> >>> >> >> >> >> Yes, thats the strange part. The last line is a flush as if the process >> >> never failed. Yes, the process is dead and hbase cannot see the node. >> >> >> >> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users and >> >>> certain >> >>> > that they are correct. >> >>> >> >>> The first line in an hbase log prints out the ulimit it sees. You >> >>> might check that the hbase process for sure is picking up your ulimit >> >>> setting. >> >>> >> >>> That was a mistake I did a couple of days ago, checked it with cat >> >> /proc/<pid of reginserver>/limits and all related users like 'hbase' >> has >> >> those limits. Checked the logs: >> >> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 >> >> ulimit -n 52768 >> >> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems. >> >>> > >> >>> >> >>> (The mystery compounds) >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction >> >>> > requested for >> >>> > >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction queue >> >>> size=18 >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>> > NOT flushing memstore for region >> >>> > >> >>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., >> >>> > flushing=false, writesEnabled=false >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>> > Started memstore flush for >> >>> > >> >>> >> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., >> >>> > current region memstore size 68.6m >> >>> > 2011-03-07 15:07:58,310 DEBUG >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>> > Flush requested on >> >>> > >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. >> >>> > -end of log file- >> >>> > --- >> >>> > >> >>> >> >>> Nothing more? >> >>> >> >>> >> >> No, nothing after that. But quite a lot of logs before that, I can send >> >> them if you'd like. >> >> >> >> >> >> >> >>> Thanks, >> >>> St.Ack >> >>> >> >> >> >> Thanks alot! >> >> >> >> >> > >> >
