Hi,

Thanks for the effort, answers below:




On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote:

> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <[email protected]>
> wrote:
> > We have a 5 node cluster, 4 of them being region servers. I am running a
> > custom workload with YCSB and when the data is loading (heavy insert) at
> > least one of the region servers are dying after about 600000 operations.
>
>
> Tell us the character of your 'custom workload' please.
>
>
The workload is below, the part that fails is the loading part (-load) which
inserts all the records first)

recordcount=10000000
operationcount=3000000
workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0.5
updateproportion=0.1
scanproportion=0
insertproportion=0.35
readmodifywriteproportion=0.05

requestdistribution=zipfian




>
> > There are no abnormalities in the logs as far as I can see, the only
> common
> > point is that all of them(in different trials, different region servers
> > fail) request for a flush as the last logs, given below. .out files are
> > empty. I am looking at the /var/log/hbase folder for logs. Running sun
> java
> > 6 latest version. I couldn't find any logs that indicates a problem with
> > java. Tried the tests with openjdk and had the same results.
> >
>
> Its strange that flush is the last thing in your log.  The process is
> dead?  We are exiting w/o a note in logs?  Thats unusual.  We usually
> scream loudly when dying.
>

Yes, thats the strange part. The last line is a flush as if the process
never failed. Yes, the process is dead and hbase cannot see the node.


>
> > I have set ulimits(50000) and xceivers(20000) for multiple users and
> certain
> > that they are correct.
>
> The first line in an hbase log prints out the ulimit it sees.  You
> might check that the hbase process for sure is picking up your ulimit
> setting.
>
> That was a mistake I did a couple of days ago, checked it with cat
/proc/<pid of reginserver>/limits  and all related users like 'hbase' has
those limits. Checked the logs:

Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
ulimit -n 52768

>
> > Also in the kernel logs, there are no apparent problems.
> >
>
> (The mystery compounds)
>
> > 2011-03-07 15:07:58,301 DEBUG
> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
> > requested for
> > usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
> > because regionserver60020.cacheFlusher; priority=3, compaction queue
> size=18
> > 2011-03-07 15:07:58,301 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > NOT flushing memstore for region
> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
> > flushing=false, writesEnabled=false
> > 2011-03-07 15:07:58,301 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Started memstore flush for
> > usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
> > current region memstore size 68.6m
> > 2011-03-07 15:07:58,310 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Flush requested on
> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
> > -end of log file-
> > ---
> >
>
> Nothing more?
>
>
No, nothing after that. But quite a lot of logs before that, I can send them
if you'd like.



> Thanks,
> St.Ack
>

Thanks alot!

Reply via email to