Re: region servers dying - flush request - YCSB

Jean-Daniel Cryans Mon, 07 Mar 2011 09:48:17 -0800

Along with a bigger portion of the log, it be might good to check if
there's anything in the .out file that looks like a jvm error.


J-D

On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]> wrote:
> I run every kind of benchmark I could find on those machines and they seemed
> to work fine. Did memory/disk tests too.
>
> The master node or other nodes provide some information and exceptions about
> that they can't reach to the dead node.
>
> Btw sometimes the process does not die but looses the connection.
>
> --
>
> deniz
>
> On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
>
>> I'm stumped.  I have nothing to go on when no death throes or
>> complaints.  This hardware for sure is healthy?  Other stuff runs w/o
>> issue?
>> St.Ack
>>
>> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <[email protected]>
>> wrote:
>> > I don't know if its normal but I see alot of '0's in the test results
>> when
>> > it tends to fail, such as:
>> >
>> >  1196 sec: 7394901 operations; 0 current ops/sec;
>> >
>> > --
>> > deniz
>> >
>> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <[email protected]>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> Thanks for the effort, answers below:
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote:
>> >>
>> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <[email protected]>
>> >>> wrote:
>> >>> > We have a 5 node cluster, 4 of them being region servers. I am
>> running a
>> >>> > custom workload with YCSB and when the data is loading (heavy insert)
>> at
>> >>> > least one of the region servers are dying after about 600000
>> operations.
>> >>>
>> >>>
>> >>> Tell us the character of your 'custom workload' please.
>> >>>
>> >>>
>> >> The workload is below, the part that fails is the loading part (-load)
>> >> which inserts all the records first)
>> >>
>> >> recordcount=10000000
>> >> operationcount=3000000
>> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>> >>
>> >> readallfields=true
>> >>
>> >> readproportion=0.5
>> >> updateproportion=0.1
>> >> scanproportion=0
>> >> insertproportion=0.35
>> >> readmodifywriteproportion=0.05
>> >>
>> >> requestdistribution=zipfian
>> >>
>> >>
>> >>
>> >>
>> >>>
>> >>> > There are no abnormalities in the logs as far as I can see, the only
>> >>> common
>> >>> > point is that all of them(in different trials, different region
>> servers
>> >>> > fail) request for a flush as the last logs, given below. .out files
>> are
>> >>> > empty. I am looking at the /var/log/hbase folder for logs. Running
>> sun
>> >>> java
>> >>> > 6 latest version. I couldn't find any logs that indicates a problem
>> with
>> >>> > java. Tried the tests with openjdk and had the same results.
>> >>> >
>> >>>
>> >>> Its strange that flush is the last thing in your log.  The process is
>> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We usually
>> >>> scream loudly when dying.
>> >>>
>> >>
>> >> Yes, thats the strange part. The last line is a flush as if the process
>> >> never failed. Yes, the process is dead and hbase cannot see the node.
>> >>
>> >>
>> >>>
>> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users and
>> >>> certain
>> >>> > that they are correct.
>> >>>
>> >>> The first line in an hbase log prints out the ulimit it sees.  You
>> >>> might check that the hbase process for sure is picking up your ulimit
>> >>> setting.
>> >>>
>> >>> That was a mistake I did a couple of days ago, checked it with cat
>> >> /proc/<pid of reginserver>/limits  and all related users like 'hbase'
>> has
>> >> those limits. Checked the logs:
>> >>
>> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
>> >> ulimit -n 52768
>> >>
>> >>>
>> >>> > Also in the kernel logs, there are no apparent problems.
>> >>> >
>> >>>
>> >>> (The mystery compounds)
>> >>>
>> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>> >>> > requested for
>> >>> >
>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>> >>> > because regionserver60020.cacheFlusher; priority=3, compaction queue
>> >>> size=18
>> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >>> > NOT flushing memstore for region
>> >>> >
>> >>>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>> >>> > flushing=false, writesEnabled=false
>> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >>> > Started memstore flush for
>> >>> >
>> >>>
>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>> >>> > current region memstore size 68.6m
>> >>> > 2011-03-07 15:07:58,310 DEBUG
>> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >>> > Flush requested on
>> >>> >
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>> >>> > -end of log file-
>> >>> > ---
>> >>> >
>> >>>
>> >>> Nothing more?
>> >>>
>> >>>
>> >> No, nothing after that. But quite a lot of logs before that, I can send
>> >> them if you'd like.
>> >>
>> >>
>> >>
>> >>> Thanks,
>> >>> St.Ack
>> >>>
>> >>
>> >> Thanks alot!
>> >>
>> >>
>> >
>>
>

Re: region servers dying - flush request - YCSB

Reply via email to