Something new came up!
I tried to truncate the 'usertable' which had ~12M entries.
Shell stayed at "disabling table" for a long time. The processes was there
but there were no requests. So I quit the state by ctrl-c.
Then tried count 'usertable' to see if data remains, shell gave an error and
one of the regionservers had a log such as below,
The master logs were also similar (tried to disable again, and the master
log is from that trial)
Regionserver 2:
2011-03-08 16:47:24,852 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer:
NotServingRegionException; Region is not online:
usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
2011-03-08 16:47:27,765 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63 MB,
free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
2011-03-08 16:47:28,863 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer:
NotServingRegionException; Region is not online:
usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
2011-03-08 16:47:28,865 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.UnknownScannerException: Name: -1
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
Masterserver:
.
.
. (same thing)
2011-03-08 16:51:34,679 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_CLOSE for too long, running forced unassign again on
region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
--
deniz
On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]> wrote:
> Hi all,
>
> Thanks for the support. I'v been trying to replicate the problem since this
> morning. Before doing that, played with the configuration. I used to have
> only one user and set all the permissions according to that. Now I'v
> followed the cloudera manuals and set permissions for hdfs and mapred users.
> (changed the hbase-env.sh)
>
> I had 2 trials, on both the yahoo test failed because of receiving lost of
> "0"s but the region servers didn't die. At some points in the test, (also
> when failed) , hbase master gave exceptions about not being able to reach
> one of the servers. I also lost the ssh connection to that server, but after
> a while it recovered. (also hmaster) The last thing in the regionserver logs
> was that it was going for a flush.
>
> I'll be going over the tests again and provide you with clean log files
> from all servers. (hadoop, hbase, namenode, masternode logs)
>
> If you have any suggestions or directions for me to better diagnose the
> problem, that would be lovely.
>
> btw: these servers do not have ECC memory but I do not see any corruption
> in data.
>
> Thanks!
>
> --
> deniz
>
>
> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans <[email protected]>wrote:
>
>> Along with a bigger portion of the log, it be might good to check if
>> there's anything in the .out file that looks like a jvm error.
>>
>> J-D
>>
>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]>
>> wrote:
>> > I run every kind of benchmark I could find on those machines and they
>> seemed
>> > to work fine. Did memory/disk tests too.
>> >
>> > The master node or other nodes provide some information and exceptions
>> about
>> > that they can't reach to the dead node.
>> >
>> > Btw sometimes the process does not die but looses the connection.
>> >
>> > --
>> >
>> > deniz
>> >
>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
>> >
>> >> I'm stumped. I have nothing to go on when no death throes or
>> >> complaints. This hardware for sure is healthy? Other stuff runs w/o
>> >> issue?
>> >> St.Ack
>> >>
>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <[email protected]>
>> >> wrote:
>> >> > I don't know if its normal but I see alot of '0's in the test results
>> >> when
>> >> > it tends to fail, such as:
>> >> >
>> >> > 1196 sec: 7394901 operations; 0 current ops/sec;
>> >> >
>> >> > --
>> >> > deniz
>> >> >
>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <[email protected]
>> >
>> >> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> Thanks for the effort, answers below:
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote:
>> >> >>
>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
>> [email protected]>
>> >> >>> wrote:
>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am
>> >> running a
>> >> >>> > custom workload with YCSB and when the data is loading (heavy
>> insert)
>> >> at
>> >> >>> > least one of the region servers are dying after about 600000
>> >> operations.
>> >> >>>
>> >> >>>
>> >> >>> Tell us the character of your 'custom workload' please.
>> >> >>>
>> >> >>>
>> >> >> The workload is below, the part that fails is the loading part
>> (-load)
>> >> >> which inserts all the records first)
>> >> >>
>> >> >> recordcount=10000000
>> >> >> operationcount=3000000
>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>> >> >>
>> >> >> readallfields=true
>> >> >>
>> >> >> readproportion=0.5
>> >> >> updateproportion=0.1
>> >> >> scanproportion=0
>> >> >> insertproportion=0.35
>> >> >> readmodifywriteproportion=0.05
>> >> >>
>> >> >> requestdistribution=zipfian
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> > There are no abnormalities in the logs as far as I can see, the
>> only
>> >> >>> common
>> >> >>> > point is that all of them(in different trials, different region
>> >> servers
>> >> >>> > fail) request for a flush as the last logs, given below. .out
>> files
>> >> are
>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
>> Running
>> >> sun
>> >> >>> java
>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
>> problem
>> >> with
>> >> >>> > java. Tried the tests with openjdk and had the same results.
>> >> >>> >
>> >> >>>
>> >> >>> Its strange that flush is the last thing in your log. The process
>> is
>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We
>> usually
>> >> >>> scream loudly when dying.
>> >> >>>
>> >> >>
>> >> >> Yes, thats the strange part. The last line is a flush as if the
>> process
>> >> >> never failed. Yes, the process is dead and hbase cannot see the
>> node.
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users
>> and
>> >> >>> certain
>> >> >>> > that they are correct.
>> >> >>>
>> >> >>> The first line in an hbase log prints out the ulimit it sees. You
>> >> >>> might check that the hbase process for sure is picking up your
>> ulimit
>> >> >>> setting.
>> >> >>>
>> >> >>> That was a mistake I did a couple of days ago, checked it with cat
>> >> >> /proc/<pid of reginserver>/limits and all related users like
>> 'hbase'
>> >> has
>> >> >> those limits. Checked the logs:
>> >> >>
>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1
>> >> >> ulimit -n 52768
>> >> >>
>> >> >>>
>> >> >>> > Also in the kernel logs, there are no apparent problems.
>> >> >>> >
>> >> >>>
>> >> >>> (The mystery compounds)
>> >> >>>
>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> Compaction
>> >> >>> > requested for
>> >> >>> >
>> >>
>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction
>> queue
>> >> >>> size=18
>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> > NOT flushing memstore for region
>> >> >>> >
>> >> >>>
>> >>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>> >> >>> > flushing=false, writesEnabled=false
>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> > Started memstore flush for
>> >> >>> >
>> >> >>>
>> >>
>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>> >> >>> > current region memstore size 68.6m
>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> > Flush requested on
>> >> >>> >
>> >>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>> >> >>> > -end of log file-
>> >> >>> > ---
>> >> >>> >
>> >> >>>
>> >> >>> Nothing more?
>> >> >>>
>> >> >>>
>> >> >> No, nothing after that. But quite a lot of logs before that, I can
>> send
>> >> >> them if you'd like.
>> >> >>
>> >> >>
>> >> >>
>> >> >>> Thanks,
>> >> >>> St.Ack
>> >> >>>
>> >> >>
>> >> >> Thanks alot!
>> >> >>
>> >> >>
>> >> >
>> >>
>> >
>>
>
>